01-Getting_Started_with_Azure_ML Flashcards
Run Configuration
Defines the Python code execution environment for the script
E.g., sets a Conda environment with some default Python packages installed
# create a new RunConfig object experiment_run_config = RunConfiguration()
Script Configuration
Identifies the Python script file to be run in the experiment, and the environment in which to run it
# Create a script config src = ScriptRunConfig(source_directory=experiment_folder, script='diabetes_experiment.py', run_config=experiment_run_config )
How to setup Model Training
- Connect to Your Workspace
- Create folder for experiment files (data + training script)
- Create and generate a Training Script
- Use an Estimator to Run the Script as an Experiment
- Register the Trained Model
Training/Entry Scripts
# import libs # Get the experiment run context # Load training data # Separate features and labels # Split data into training set and test set # create and Train some model # Score / predict model # Evaluate # Save the model to experiment folder # Wait for completion (run.complete())
Estimator
You can run experiment scripts using a RunConfiguration and a ScriptRunConfig, or you can use an Estimator, which abstracts both of these configurations in a single object to run the training experiment.
An estimator runs a training script
Create an estimator
estimator = Estimator(source_directory=training_folder, entry_script=’diabetes_training.py’,
compute_target=’local’,
conda_packages=[‘scikit-learn’]
)
# Create an experiment experiment_name = 'diabetes-training' experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment based on the estimator run = experiment.submit(config=estimator) run.wait_for_completion(show_output=True)
Create and Run an Experiment
experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment run = experiment.submit(config=estimator)
RunDetails widget
As with any experiment run, you can use the RunDetails widget to view information about the run and get a link to it in Azure Machine Learning studio
Retrieve the metrics and outputs from the Run object.
# Get logged metrics metrics = run.get_metrics() for key in metrics.keys(): print(key, metrics.get(key)) print('\n') for file in run.get_file_names(): print(file)
Output
Regularization Rate 0.01
Accuracy 0.774
AUC 0.8483377282451863
azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/8_azureml.log
outputs/diabetes_model.pkl
Register a Trained Model
Note that the outputs of the experiment include the trained model file (diabetes_model.pkl).
You can register a model in your Azure Machine Learning workspace, making it possible to track model versions and retrieve them later.
# Register the model run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'Estimator'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})
Create a Parameterized Training Script
You can increase the flexibility of your training experiment by adding parameters to your entry script, enabling you to repeat the same training experiment with different settings
# Set regularization hyperparameter parser = argparse.ArgumentParser() parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01) args = parser.parse_args() reg = args.reg
Use a Framework-Specific Estimator
You used a generic Estimator class to run the training script, but you can also take advantage of framework-specific estimators that include environment definitions for common machine learning frameworks. In this case, you’re using Scikit-Learn, so you can use the SKLearn estimator. This means that you don’t need to specify the scikit-learn package in the configuration.
# Create an estimator estimator = SKLearn(source_directory=training_folder, entry_script='diabetes_training.py', script_params = {'--reg_rate': 0.1}, compute_target='local' )
Working with Data
Data is the foundation on which machine learning models are built. Managing data centrally in the cloud, and making it accessible to teams of data scientists who are running experiments and training models on multiple workstations and compute targets is an important part of any professional data science solution.
Datastore
In Azure ML, datastores are references to storage locations, such as Azure Storage blob containers. Every workspace has a default datastore - usually the Azure storage blob container that was created with the workspace.
If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.
You can use local data files to train a model, but when running training workloads automatically on cloud-based compute, it makes more sense to store the data centrally in the cloud and ingest it into the training script wherever it happens to be running.
Upload Data to a Datastore
You can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.
default_ds.upload_files(files=[’./data/diabetes.csv’, ‘./data/diabetes2.csv’], # Upload the diabetes csv files in /data
target_path=’diabetes-data/’, # Put it in a folder path in the datastore
overwrite=True, # Replace existing files of the same name
show_progress=True)
Train a Model from a Datastore
When you uploaded the files in the code cell above, note that the code returned a data reference.
The data reference can be used to download the contents of the folder to the compute context where the data reference is being used
Downloading data works well for small volumes of data that will be processed on local compute. When working with remote compute, you can also configure a data reference to mount the datastore location and read data directly from the data source.
The entry script (via Estimator/experiment) will load the training data from the data reference passed to it as a parameter
# Set up the parameters script_params = { '--regularization': 0.1, # regularization rate '--data-folder': data_ref # data reference to download files from datastore }
# Create an estimator estimator = SKLearn(source_directory=experiment_folder, entry_script='diabetes_training.py', script_params=script_params, compute_target = 'local' )
# Create an experiment experiment_name = 'diabetes-training' experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment run = experiment.submit(config=estimator)
Data reference
A data reference provides a way to pass the path to a folder in a datastore to a script, regardless of where the script is being run, so that the script can access data in the datastore location.
The data reference can be used to download the contents of the folder to the compute context where the data reference is being used
Downloading data works well for small volumes of data that will be processed on local compute. When working with remote compute, you can also configure a data reference to mount the datastore location and read data directly from the data source.
Datasets
While you can read data directly from datastores, Azure Machine Learning provides a further abstraction for data in the form of datasets.
A dataset is a versioned reference to a specific set of data that you may want to use in an experiment.
Datasets can be tabular or file-based.
It’s easy to convert a tabular dataset to a Pandas dataframe, enabling you to work with the data using common Python techniques.
Create a Tabular Dataset
from azureml.core import Dataset
# Get the default datastore default_ds = ws.get_default_datastore()
#Create a tabular dataset from the path on the datastore tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))
Create a File Dataset
Some machine learning scenarios you might need to work with data that is unstructured; or you may simply want to handle reading the data from files in your own code. To accomplish this, you can use a file dataset, which creates a list of file paths in a virtual mount point, which you can use to read the data in the files.
# Create a file dataset from the path on the datastore file_data_set = Dataset.File.from_files(path=(default_ds, 'diabetes-data/*.csv'))
# Get the files in the dataset for file_path in file_data_set.to_path(): print(file_path)
Register Datasets
You can register data sets to make them easily accessible to any experiment being run in the workspace.
You can view and manage datasets on the Datasets page for your workspace in Azure ML Studio or via code.
# Register the tabular dataset try: tab_data_set = tab_data_set.register(workspace=ws, name='diabetes dataset', description='diabetes data', tags = {'format':'CSV'}, create_new_version=True) except Exception as ex: print(ex)
# Register the file dataset try: file_data_set = file_data_set.register(workspace=ws, name='diabetes file dataset', description='diabetes files', tags = {'format':'CSV'}, create_new_version=True) except Exception as ex: print(ex)
print(‘Datasets registered’)
Train a Model from a Tabular Dataset
Now that you have datasets, you’re ready to start training models from them. You can pass datasets to scripts as inputs in the estimator being used to run the script.
# Get the training dataset diabetes_ds = ws.datasets.get("diabetes dataset")
# Create an estimator estimator = SKLearn(source_directory=experiment_folder, entry_script='diabetes_training.py', script_params=script_params, compute_target = 'local', # Pass the Dataset object as an input... inputs=[diabetes_ds.as_named_input('diabetes')], pip_packages=['azureml-dataprep[pandas]'] )
Train a Model from a File Dataset
When you’re using a file dataset, the dataset input passed to the script represents a mount point containing file paths. How you read the data from these files depends on the kind of data in the files and what you want to do with it.
You can use the Python glob module to create a list of files in the virtual mount point defined by the dataset, and read them all into Pandas dataframes that are concatenated into a single dataframe.
For large volumes of data, you’d generally use the as_mount method to stream the files directly from the dataset source; but when running on local compute, you need to use the as_download option to download the dataset files to a local folder.
# Get the training dataset diabetes_ds = ws.datasets.get("diabetes file dataset")
# Create an estimator estimator = SKLearn(source_directory=experiment_folder, entry_script='diabetes_training.py', script_params=script_params, compute_target = 'local', inputs=[diabetes_ds.as_named_input('diabetes').as_download(path_on_compute='diabetes_data')], pip_packages=['azureml-dataprep[pandas]'] )
Working with Compute
When you run a script as an Azure Machine Learning experiment, you need to define the execution context for the experiment run. The execution context is made up of:
- The Python environment for the script, which must include all Python packages used in the script. The compute will require a Python environment with the necessary package dependencies installed
- The compute target on which the script will be run.
This could be the local workstation from which the experiment run is initiated, or a remote compute target such as a training cluster that is provisioned on-demand.
Define an Environment (Run Configuration)
When you run a Python script as an experiment in Azure Machine Learning, a Conda environment is automatically created to define the execution context for the script.
Azure Machine Learning provides a default environment that includes many common packages; including the azureml-defaults package that contains the libraries necessary for working with an experiment run, as well as popular packages like pandas and numpy.
You can also define your own environment and add packages by using conda or pip, to ensure your experiment has access to all the libraries it requires.
Example
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
# Create a Python environment for the experiment diabetes_env = Environment("diabetes-experiment-env") # Let Azure ML manage dependencies diabetes_env.python.user_managed_dependencies = False # Use a docker container diabetes_env.docker.enabled = True
# Create a set of package dependencies (conda or pip as required) diabetes_packages = CondaDependencies.create(conda_packages=['scikit-learn'], pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]'])
# Add the dependencies to the environment diabetes_env.python.conda_dependencies = diabetes_packages
print(diabetes_env.name, ‘defined.’)
# Register the environment diabetes_env.register(workspace=ws)
Use in Estimator:
# Create an estimator estimator = Estimator(source_directory=experiment_folder, inputs=[diabetes_ds.as_named_input('diabetes')], script_params=script_params, compute_target = 'local', environment_definition = diabetes_env, entry_script='diabetes_training.py')
# Create an experiment experiment = Experiment(workspace = ws, name = 'diabetes-training')
# Run the experiment run = experiment.submit(config=estimator)
Register the environment
Having gone to the trouble of defining an environment with the packages you need, you can register it in the workspace.
# Register the environment diabetes_env.register(workspace=ws)
Run an Experiment on a Remote Compute Target
In many cases, your local compute resources may not be sufficient to process a complex or long-running experiment that needs to process a large volume of data; and you may want to take advantage of the ability to dynamically create and use compute resources in the cloud.
Azure ML supports a range of compute targets, which you can define in your workpace and use to run experiments; paying for the resources only when using them.
In this case, we’ll run the diabetes training experiment on a compute cluster with a unique name of your choosing
You can do this by specifying the compute_target parameter in the estimator (you can set this to either the name of the compute target, or a ComputeTarget object.)
Example
compute_config = AmlCompute.provisioning_configuration(vm_size=’STANDARD_D2_V2’, max_nodes=4)
training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
# Create an estimator estimator = Estimator(source_directory=experiment_folder, inputs=[diabetes_ds.as_named_input('diabetes')], script_params=script_params, compute_target = cluster_name, # Run the experiment on the remote compute target environment_definition = registered_env, entry_script='diabetes_training.py')
Creating an Azure Machine Learning Pipeline
You can perform the various steps required to ingest data, train a model, and register the model individually by using the Azure ML SDK to run script-based experiments.
However, in an enterprise environment it is common to encapsulate the sequence of discrete steps required to build a machine learning solution into a pipeline that can be run on one or more compute targets, either on-demand by a user, from an automated build process, or on a schedule.
Create Scripts for Pipeline Steps
Pipelines consist of one or more steps, which can be Python scripts, or specialized steps like an Auto ML training estimator or a data transfer step that copies data from one location to another. Each step can run in its own compute context.
The pipeline will eventually be published and run on-demand, so it needs a compute environment in which to run.
Define Example Pipeline
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.train.estimator import Estimator
# Get the training dataset diabetes_ds = ws.datasets.get("diabetes dataset")
# Create a PipelineData (Data Reference) for the model folder model_folder = PipelineData("model_folder", datastore=ws.get_default_datastore())
estimator = Estimator(source_directory=experiment_folder,
compute_target = pipeline_cluster,
environment_definition=pipeline_run_config.environment,
entry_script=’train_diabetes.py’)
# Step 1, run the estimator to train the model train_step = EstimatorStep(name = "Train Model", estimator=estimator, estimator_entry_script_arguments=['--output_folder', model_folder], inputs=[diabetes_ds.as_named_input('diabetes_train')], outputs=[model_folder], compute_target = pipeline_cluster, allow_reuse = True)
# Step 2, run the model registration script register_step = PythonScriptStep(name = "Register Model", source_directory = experiment_folder, script_name = "register_diabetes.py", arguments = ['--model_folder', model_folder], inputs=[model_folder], compute_target = pipeline_cluster, runconfig = pipeline_run_config, allow_reuse = True)
print(“Pipeline steps defined”)
Prepare a Compute Environment for the Pipeline Steps
The pipeline will eventually be published and run on-demand, so it needs a compute environment in which to run.
You can use the same compute for alls steps, but it’s important to realize that each step is run independently; so you could specify different compute contexts for each step if appropriate.
Create and Run a Pipeline
First you need to define the steps for the pipeline, and any data references that need to passed between them, using a PipelineData object.
In this case, the first step must write the model to a folder that can be read from by the second step.
Since the steps will be run on remote compute (and in fact, could each be run on different compute), the folder path must be passed as a data reference to a location in a datastore within the workspace.
The PipelineData object
The PipelineData object is a special kind of data reference that is used to pass data from the output of one pipeline step to the input of another, creating a dependency between them.
Build the defined pipeline and run it as an experiment
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails
# Construct the pipeline pipeline_steps = [train_step, register_step] pipeline = Pipeline(workspace = ws, steps=pipeline_steps) print("Pipeline is built.")
# Create an experiment and run the pipeline experiment = Experiment(workspace = ws, name = 'diabetes-training-pipeline') pipeline_run = experiment.submit(pipeline, regenerate_outputs=True) print("Pipeline submitted for execution.")
RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion()
You can also monitor pipeline runs in the Experiments page in Azure Machine Learning studio.
Publish a Pipeline
When you’ve created a pipeline and verified it works, you can publish it as a REST service
published_pipeline = pipeline.publish(name=”Diabetes_Training_Pipeline”,
description=”Trains diabetes model”, version=”1.0”)
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)
Call a Pipeline
To use the endpoint, client applications need to make a REST call over HTTP. This request must be authenticated, so an authorization header is required. A real application would require a service principal with which to be authenticated
published_pipeline_run = PipelineRun(ws.experiments[experiment_name], run_id)
Azure ML Pipelines vs Azure DevOps Pipelines
You can use the Azure Machine Learning extension for Azure DevOps to combine Azure ML pipelines with Azure DevOps pipelines and integrate model retraining into a continuous integration/continuous deployment (CI/CD) process.
For example you could use an Azure DevOps build pipeline to trigger an Azure ML pipeline that trains and registers a model, and when the model is registered it could trigger an Azure Devops release pipeline that deploys the model as a web service, along with the application or service that consumes the model.
Register a Model
# Register the model run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'Inline Training'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})
Deploy a Model as a Web Service
- We’re going to create a web service to host this model, and this will require some code and configuration files; so let’s create a folder for those.
- The web service where we deploy the model will need some Python code to load the input data, get the model from the workspace, and generate and return predictions. We’ll save this code in an score script that will be deployed to the web service
- init() –> # Loads the mode when the service is loaded
- run(input_data) –> # Called when a request is received
- The web service will be hosted in a Azure container instance (ACI), and the container will need to install any required Python dependencies when it gets initialized. So we’ll create a .yml file that tells the container host to install this into the environment.
- We’ll deploy the container a service named diabetes-service. The deployment process includes the following steps:
- Define an inference configuration (scoring environment), which includes the scoring and environment files required to load and use the model.
- Define a deployment configuration that defines the execution environment in which the service will be hosted. In this case, an Azure Container Instance. - Deploy the model as a web service.
- Verify the status of the deployed service.
- Define an inference configuration (scoring environment), which includes the scoring and environment files required to load and use the model.
Example
from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig
# Configure the scoring environment inference_config = InferenceConfig(runtime= "python", source_directory = folder_name, entry_script="score_diabetes.py", conda_file="diabetes_env.yml")
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)
service_name = “diabetes-service”
service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)
service.wait_for_deployment(True)
print(service.state)
Score Script Example (real-time)
%%writefile $folder_name/score_diabetes.py import json import joblib import numpy as np from azureml.core.model import Model
# Called when the service is loaded def init(): global model # Get the path to the deployed model file and load it model_path = Model.get_model_path('diabetes_model') model = joblib.load(model_path)
# Called when a request is received def run(raw_data): # Get the input data as a numpy array data = np.array(json.loads(raw_data)['data']) # Get a prediction from the model predictions = model.predict(data) # Get the corresponding classname for each prediction (0 or 1) classnames = ['not-diabetic', 'diabetic'] predicted_classes = [] for prediction in predictions: predicted_classes.append(classnames[prediction]) # Return the predictions as JSON return json.dumps(predicted_classes)
Consume ACI Web Service (SDK) (Real time inferencing)
With the service deployed, now you can consume it from a client application.
The code below uses the Azure ML SDK to connect to the containerized web service and use it to generate predictions from your diabetes classification model. In production, a model is likely to be consumed by business applications that do not use the Azure ML SDK, but simply make HTTP requests to the web service.
import json
This time our input is an array of two feature arrays
x_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
[0,148,58,11,179,39.19207553,0.160829008,45]]
# Convert the array or arrays to a serializable list in a JSON document input_json = json.dumps({"data": x_new})
# Call the web service, passing the input data predictions = service.run(input_data = input_json)
# Get the predicted classes. predicted_classes = json.loads(predictions)
for i in range(len(x_new)):
print (“Patient {}”.format(x_new[i]), predicted_classes[i] )
Example Service Endpoint Uri
endpoint = service.scoring_uri
print(endpoint)
http://34733966-1951-4854-8c7c-1173ec0aae1b.northeurope.azurecontainer.io/score
Consume ACI Web Service (REST Endpoint) (Real time inferencing)
Now that you know the endpoint URI, an application can simply make an HTTP request, sending the patient data in JSON (or binary) format, and receive back the predicted class(es).
import requests
import json
x_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
[0,148,58,11,179,39.19207553,0.160829008,45]]
# Convert the array to a serializable list in a JSON document input_json = json.dumps({"data": x_new})
# Set the content type headers = { 'Content-Type':'application/json' }
predictions = requests.post(endpoint, input_json, headers = headers) predicted_classes = json.loads(predictions.json())
for i in range(len(x_new)):
print (“Patient {}”.format(x_new[i]), predicted_classes[i] )
Batch inferencing
to process data as a batch
Create a Pipeline for Batch Inferencing
- Our pipeline will need Python code to perform the batch inferencing, so let’s create a folder where we can keep all the files used by the pipeline
- Now we’ll create a Python batch score/inference script to do the actual work, and save it in the pipeline folder
- Define Run Context with dependencies for the scoring script
- Define a ParallelRunStep config and a ParallelRunStep that calls the batch scoring script
- Create Pipeline including the ParallelRunStep
- Run the pipeline as an experiment
- Publish the Pipeline and use its REST Interface
Batch Score Script Example (real-time)
%%writefile $experiment_folder/batch_diabetes.py import os import numpy as np from azureml.core import Model import joblib
def init(): # Runs when the pipeline step is initialized global model
# load the model model_path = Model.get_model_path('diabetes_model') model = joblib.load(model_path)
def run(mini_batch): # This runs for each batch resultList = []
# process each file in the batch for f in mini_batch: # Read the comma-delimited data into an array data = np.genfromtxt(f, delimiter=',') # Reshape into a 2-dimensional array for prediction (model expects multiple items) prediction = model.predict(data.reshape(1, -1)) # Append prediction to results resultList.append("{}: {}".format(os.path.basename(f), prediction[0])) return resultList
ParallelRunStep
Enables the batch data to be processed in parallel and the results collated in a single output file
Tuning Hyperparameters
There are many machine learning algorithms that require hyperparameters (parameter values that influence training, but can’t be determined from the training data itself).
For example, when training a logistic regression model, you can use a regularization rate hyperparameter to counteract bias in the model; or when training a convolutional neural network, you can use hyperparameters like learning rate and batch size to control how weights are adjusted and how many data items are processed in a mini-batch respectively.
The choice of hyperparameter values can significantly affect the performance of a trained model, or the time taken to train it; and often you need to try multiple combinations to find the optimal solution.
Exampel Hyperdrive Experiment
from azureml.core import Experiment
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive import GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails
Sample a range of parameter values
params = GridParameterSampling(
{
# There’s only one parameter, so grid sampling will try each value - with multiple parameters it would try every combination
‘–regularization’: choice(0.001, 0.005, 0.01, 0.05, 0.1, 1.0)
}
)
# Get the training dataset diabetes_ds = ws.datasets.get("diabetes dataset")
# Create an estimator that uses the remote compute hyper_estimator = SKLearn(source_directory=experiment_folder, inputs=[diabetes_ds.as_named_input('diabetes')], # Pass the dataset as an input... pip_packages=['azureml-sdk'], # ...so we need azureml-dataprep (it's in the SDK!) entry_script='diabetes_training.py', compute_target = training_cluster,)
# Configure hyperdrive settings hyperdrive = HyperDriveConfig(estimator=hyper_estimator, hyperparameter_sampling=params, policy=None, primary_metric_name='AUC', primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, max_total_runs=6, max_concurrent_runs=4)
# Run the experiment experiment = Experiment(workspace = ws, name = 'diabates_training_hyperdrive') run = experiment.submit(config=hyperdrive)
Show the status in the notebook as the experiment runs
RunDetails(run).show()
run.wait_for_completion()
Hyperdrive Experiments
Azure Machine Learning includes a hyperparameter tuning capability through Hyperdrive experiments.
These experiments launch multiple child runs, each with a different hyperparameter combination.
The run producing the best model (as determined by the logged target performance metric for which you want to optimize) can be identified, and its trained model selected for registration and deployment.
Hyperparameter Tuning - Determine the Best Performing Run
When all of the runs have finished, you can find the best one based on the performance metric you specified (in this case, the one with the best AUC).
for child_run in run.get_children_sorted_by_primary_metric():
print(child_run)
best_run = run.get_best_run_by_primary_metric() best_run_metrics = best_run.get_metrics() parameter_values = best_run.get_details() ['runDefinition']['arguments']
print(‘Best Run Id: ‘, best_run.id)
print(‘ -AUC:’, best_run_metrics[‘AUC’])
print(‘ -Accuracy:’, best_run_metrics[‘Accuracy’])
print(‘ -Regularization Rate:’,parameter_values)
from azureml.core import Model
# Register best model best_run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'Hyperdrive'}, properties={'AUC': best_run_metrics['AUC'], 'Accuracy': best_run_metrics['Accuracy']})
# List registered models for model in Model.list(ws): print(model.name, 'version:', model.version) for tag_name in model.tags: tag = model.tags[tag_name] print ('\t',tag_name, ':', tag) for prop_name in model.properties: prop = model.properties[prop_name] print ('\t',prop_name, ':', prop) print('\n')
Automated Machine Learning
There are many kinds of machine learning algorithm that you can use to train a model, and sometimes it’s not easy to determine the most effective algorithm for your particular data and prediction requirements.
Additionally, you can significantly affect the predictive performance of a model by preprocessing the training data, using techniques such as normalization, missing feature imputation, and others. In your quest to find the best model for your requirements, you may need to try many combinations of algorithms and preprocessing transformations; which takes a lot of time and compute resources.
Azure Machine Learning enables you to automate the comparison of models trained using different algorithms and preprocessing options. You can use the visual interface in Azure Machine Learning studio or the SDK to leverage this capability. he SDK gives you greater control over the settings for the automated machine learning experiment, but the visual interface is easier to use. In this lab, you’ll explore automated machine learning using the SDK.
Automated Machine Learning (SDK)
You don’t need to create a training script for automated machine learning, but you do need to
- create the the training and test data (split) and save to a datastore
- Setup a Compute
- Configure the Auto ML Experiment
4 Run an Automated Machine Learning Experiment
- Get the best model
- Register the best model
Example Auto ML Experiment
To configure the automated machine learning experiment, you’ll need a run configuration that includes the required packages for the experiment environment, and a set of configuration settings that specifies how many combinations to try, which metric to use when evaluating models, and so on.
from azureml.train.automl import AutoMLConfig
automl_config = AutoMLConfig(name=’Automated ML Experiment’,
task=’classification’,
compute_target=training_cluster,
training_data = train_ds,
validation_data = test_ds,
label_column_name=’Diabetic’,
iterations=6,
primary_metric = ‘AUC_weighted’,
max_concurrent_iterations=2,
featurization=’auto’
)
print(“Ready for Auto ML run.”)
Run an Automated Machine Learning Experiment
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails
print(‘Submitting Auto ML experiment…’)
automl_experiment = Experiment(ws, ‘diabetes_automl’)
automl_run = automl_experiment.submit(automl_config)
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)
Interpreting Models
You can use Azure Machine Learning to interpret a model by using an explainer that quantifies the amount of influence each feature contribues to the predicted label.
There are many common explainers, each suitable for different kinds of modeling algorithm; but the basic approach to using them is the same.
Explainer
Quantifies the amount of influence each feature contribues to the predicted label. That is: How do the features in the data influence the prediction?
There are many kinds of explainer. In this example you’ll use a Tabular Explainer, which is a “black box” explainer that can be used to explain many kinds of model by invoking an appropriate SHAP model explainer.
Get an Explainer for our Model
Get a suitable explainer for the model from the Azure ML interpretability library
Get Global Feature Importance
The first thing to do is try to explain the model by evaluating the overall feature importance - in other words, quantifying the extent to which each feature influences the prediction based on the whole training dataset.
Output
Pregnancies : 0.2194762749294642 Age : 0.10575947971825919 BMI : 0.09306316543787874 SerumInsulin : 0.06734976452903166 PlasmaGlucose : 0.05007378902962012 TricepsThickness : 0.021124772576803175 DiastolicBloodPressure : 0.016574790766927222 DiabetesPedigree : 0.016206788169148716
Get Local Feature Importance
So you have an overall view, but what about explaining individual observations? Let’s generate local explanations for individual predictions, quantifying the extent to which each feature influenced the decision to predict each of the possible label values.
In this case, it’s a binary model, so there are two possible labels (non-diabetic and diabetic); and you can quantify the influence of each feature for each of these label values for individual observations in a dataset. You’ll just evaluate the first two cases in the test dataset.
Output
Support for not-diabetic
Observation 1
SerumInsulin : 0.36925304330130265
Age : 0.2390809685204034
TricepsThickness : 0.025815337535141827
BMI : 0.012977411808708952
DiabetesPedigree : 0.002921802522673878
DiastolicBloodPressure : -0.015906526133378316
PlasmaGlucose : -0.036300469029731476
Pregnancies : -0.26441299709655025
———-
Total: 0.3334285714285707 Prediction: not-diabetic
Support for diabetic
Observation 1
Pregnancies : 0.26441299709655014
PlasmaGlucose : 0.03630046902973156
DiastolicBloodPressure : 0.015906526133378347
DiabetesPedigree : -0.002921802522673868
BMI : -0.012977411808708974
TricepsThickness : -0.025815337535141855
Age : -0.23908096852040375
SerumInsulin : -0.369253043301303
———-
Total: -0.3334285714285714 Prediction: not-diabetic
Adding Explainability to Azure ML Models Training Experiments
You can generate explanations for models trained outside of Azure ML; but when you use experiments to train models in your Azure ML workspace, you can generate model explanations and log them.
Train and Explain a Model using an Experiment Example
%%writefile $experiment_folder/diabetes_training.py # Import libraries import pandas as pd import numpy as np import joblib from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import roc_auc_score from sklearn.metrics import roc_curve
# Import Azure ML run library from azureml.core.run import Run
# Import libraries for model explanation from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient from interpret.ext.blackbox import TabularExplainer
# Get the experiment run context run = Run.get_context()
# load the diabetes dataset print("Loading Data...") data = pd.read_csv('diabetes.csv')
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age'] labels = ['not-diabetic', 'diabetic']
# Separate features and labels X, y = data[features].values, data['Diabetic'].values
# Split data into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
# Train a decision tree model print('Training a decision tree model') model = DecisionTreeClassifier().fit(X_train, y_train)
# calculate accuracy y_hat = model.predict(X_test) acc = np.average(y_hat == y_test) run.log('Accuracy', np.float(acc))
# calculate AUC y_scores = model.predict_proba(X_test) auc = roc_auc_score(y_test,y_scores[:,1]) run.log('AUC', np.float(auc))
os.makedirs('outputs', exist_ok=True) # note file saved in the outputs folder is automatically uploaded into experiment record joblib.dump(value=model, filename='outputs/diabetes.pkl')
# Get explanation explainer = TabularExplainer(model, X_train, features=features, classes=labels) explanation = explainer.explain_global(X_test)
# Get an Explanation Client and upload the explanation explain_client = ExplanationClient.from_run(run) explain_client.upload_model_explanation(explanation, comment='Tabular Explanation')
# Complete the run run.complete()
Monitoring a Model - Enable Application Insights
When you’ve deployed a model into production as a service, you’ll want to monitor it to track usage and explore the requests it processes.
# Enable AppInsights aci_service.update(enable_app_insights=True) print(aci_service.state) print('AppInsights enabled!')
Monitoring Data Drift
- Install the DataDriftDetector module
- Create a Baseline Dataset
- Create a Target Dataset
- Create a Data Drift Monitor
- Backfill the Monitor
- Analyze Data Drift