01-Getting_Started_with_Azure_ML Flashcards

1
Q

Run Configuration

A

Defines the Python code execution environment for the script

E.g., sets a Conda environment with some default Python packages installed

# create a new RunConfig object
experiment_run_config = RunConfiguration()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Script Configuration

A

Identifies the Python script file to be run in the experiment, and the environment in which to run it

# Create a script config
src = ScriptRunConfig(source_directory=experiment_folder, 
                      script='diabetes_experiment.py',
                      run_config=experiment_run_config
)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to setup Model Training

A
  1. Connect to Your Workspace
  2. Create folder for experiment files (data + training script)
  3. Create and generate a Training Script
  4. Use an Estimator to Run the Script as an Experiment
  5. Register the Trained Model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Training/Entry Scripts

A
# import libs
# Get the experiment run context
# Load training data
# Separate features and labels
# Split data into training set and test set
# create and Train some model
# Score / predict model
# Evaluate
# Save the model to experiment folder
# Wait for completion (run.complete())
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Estimator

A

You can run experiment scripts using a RunConfiguration and a ScriptRunConfig, or you can use an Estimator, which abstracts both of these configurations in a single object to run the training experiment.

An estimator runs a training script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Create an estimator

A

estimator = Estimator(source_directory=training_folder, entry_script=’diabetes_training.py’,
compute_target=’local’,
conda_packages=[‘scikit-learn’]
)

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment based on the estimator
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Create and Run an Experiment

A

experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment
run = experiment.submit(config=estimator)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RunDetails widget

A

As with any experiment run, you can use the RunDetails widget to view information about the run and get a link to it in Azure Machine Learning studio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Retrieve the metrics and outputs from the Run object.

A
# Get logged metrics
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
    print(file)

Output

Regularization Rate 0.01
Accuracy 0.774
AUC 0.8483377282451863

azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/8_azureml.log
outputs/diabetes_model.pkl

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Register a Trained Model

A

Note that the outputs of the experiment include the trained model file (diabetes_model.pkl).

You can register a model in your Azure Machine Learning workspace, making it possible to track model versions and retrieve them later.

# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Estimator'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Create a Parameterized Training Script

A

You can increase the flexibility of your training experiment by adding parameters to your entry script, enabling you to repeat the same training experiment with different settings

# Set regularization hyperparameter
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
args = parser.parse_args()
reg = args.reg
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Use a Framework-Specific Estimator

A

You used a generic Estimator class to run the training script, but you can also take advantage of framework-specific estimators that include environment definitions for common machine learning frameworks. In this case, you’re using Scikit-Learn, so you can use the SKLearn estimator. This means that you don’t need to specify the scikit-learn package in the configuration.

# Create an estimator
estimator = SKLearn(source_directory=training_folder,
entry_script='diabetes_training.py',
                    script_params = {'--reg_rate': 0.1},
                    compute_target='local'
                    )
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Working with Data

A

Data is the foundation on which machine learning models are built. Managing data centrally in the cloud, and making it accessible to teams of data scientists who are running experiments and training models on multiple workstations and compute targets is an important part of any professional data science solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Datastore

A

In Azure ML, datastores are references to storage locations, such as Azure Storage blob containers. Every workspace has a default datastore - usually the Azure storage blob container that was created with the workspace.

If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.

You can use local data files to train a model, but when running training workloads automatically on cloud-based compute, it makes more sense to store the data centrally in the cloud and ingest it into the training script wherever it happens to be running.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Upload Data to a Datastore

A

You can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.

default_ds.upload_files(files=[’./data/diabetes.csv’, ‘./data/diabetes2.csv’], # Upload the diabetes csv files in /data
target_path=’diabetes-data/’, # Put it in a folder path in the datastore
overwrite=True, # Replace existing files of the same name
show_progress=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Train a Model from a Datastore

A

When you uploaded the files in the code cell above, note that the code returned a data reference.

The data reference can be used to download the contents of the folder to the compute context where the data reference is being used

Downloading data works well for small volumes of data that will be processed on local compute. When working with remote compute, you can also configure a data reference to mount the datastore location and read data directly from the data source.

The entry script (via Estimator/experiment) will load the training data from the data reference passed to it as a parameter

# Set up the parameters
script_params = {
    '--regularization': 0.1, # regularization rate
    '--data-folder': data_ref # data reference to download files from datastore
}
# Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                    entry_script='diabetes_training.py',
                    script_params=script_params,
                    compute_target = 'local'
                   )
# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment
run = experiment.submit(config=estimator)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data reference

A

A data reference provides a way to pass the path to a folder in a datastore to a script, regardless of where the script is being run, so that the script can access data in the datastore location.

The data reference can be used to download the contents of the folder to the compute context where the data reference is being used

Downloading data works well for small volumes of data that will be processed on local compute. When working with remote compute, you can also configure a data reference to mount the datastore location and read data directly from the data source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Datasets

A

While you can read data directly from datastores, Azure Machine Learning provides a further abstraction for data in the form of datasets.

A dataset is a versioned reference to a specific set of data that you may want to use in an experiment.

Datasets can be tabular or file-based.

It’s easy to convert a tabular dataset to a Pandas dataframe, enabling you to work with the data using common Python techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Create a Tabular Dataset

A

from azureml.core import Dataset

# Get the default datastore
default_ds = ws.get_default_datastore()
#Create a tabular dataset from the path on the datastore 
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Create a File Dataset

A

Some machine learning scenarios you might need to work with data that is unstructured; or you may simply want to handle reading the data from files in your own code. To accomplish this, you can use a file dataset, which creates a list of file paths in a virtual mount point, which you can use to read the data in the files.

# Create a file dataset from the path on the datastore 
file_data_set = Dataset.File.from_files(path=(default_ds, 'diabetes-data/*.csv'))
# Get the files in the dataset
for file_path in file_data_set.to_path():
    print(file_path)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Register Datasets

A

You can register data sets to make them easily accessible to any experiment being run in the workspace.

You can view and manage datasets on the Datasets page for your workspace in Azure ML Studio or via code.

# Register the tabular dataset
try:
    tab_data_set = tab_data_set.register(workspace=ws, 
                                        name='diabetes dataset',
                                        description='diabetes data',
                                        tags = {'format':'CSV'},
                                        create_new_version=True)
except Exception as ex:
    print(ex)
# Register the file dataset
try:
    file_data_set = file_data_set.register(workspace=ws,
                                            name='diabetes file dataset',
                                            description='diabetes files',
                                            tags = {'format':'CSV'},
                                            create_new_version=True)
except Exception as ex:
    print(ex)

print(‘Datasets registered’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Train a Model from a Tabular Dataset

A

Now that you have datasets, you’re ready to start training models from them. You can pass datasets to scripts as inputs in the estimator being used to run the script.

# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes dataset")
# Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                    entry_script='diabetes_training.py',
                    script_params=script_params,
                    compute_target = 'local',
                   # Pass the Dataset object as an input...                    
                  inputs=[diabetes_ds.as_named_input('diabetes')], 
                  pip_packages=['azureml-dataprep[pandas]'] 
)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Train a Model from a File Dataset

A

When you’re using a file dataset, the dataset input passed to the script represents a mount point containing file paths. How you read the data from these files depends on the kind of data in the files and what you want to do with it.

You can use the Python glob module to create a list of files in the virtual mount point defined by the dataset, and read them all into Pandas dataframes that are concatenated into a single dataframe.

For large volumes of data, you’d generally use the as_mount method to stream the files directly from the dataset source; but when running on local compute, you need to use the as_download option to download the dataset files to a local folder.

# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes file dataset")
# Create an estimator
estimator = SKLearn(source_directory=experiment_folder,
                    entry_script='diabetes_training.py',
                    script_params=script_params,
                    compute_target = 'local',
                    inputs=[diabetes_ds.as_named_input('diabetes').as_download(path_on_compute='diabetes_data')], 
                    pip_packages=['azureml-dataprep[pandas]'] 
                   )
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Working with Compute

A

When you run a script as an Azure Machine Learning experiment, you need to define the execution context for the experiment run. The execution context is made up of:

  • The Python environment for the script, which must include all Python packages used in the script. The compute will require a Python environment with the necessary package dependencies installed
  • The compute target on which the script will be run.

This could be the local workstation from which the experiment run is initiated, or a remote compute target such as a training cluster that is provisioned on-demand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Define an Environment (Run Configuration)

A

When you run a Python script as an experiment in Azure Machine Learning, a Conda environment is automatically created to define the execution context for the script.

Azure Machine Learning provides a default environment that includes many common packages; including the azureml-defaults package that contains the libraries necessary for working with an experiment run, as well as popular packages like pandas and numpy.

You can also define your own environment and add packages by using conda or pip, to ensure your experiment has access to all the libraries it requires.

Example

from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

# Create a Python environment for the experiment
diabetes_env = Environment("diabetes-experiment-env")
# Let Azure ML manage dependencies
diabetes_env.python.user_managed_dependencies = False 
# Use a docker container
diabetes_env.docker.enabled = True 
# Create a set of package dependencies (conda or pip as required)
diabetes_packages = CondaDependencies.create(conda_packages=['scikit-learn'], pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]'])
# Add the dependencies to the environment
diabetes_env.python.conda_dependencies = diabetes_packages

print(diabetes_env.name, ‘defined.’)

# Register the environment
diabetes_env.register(workspace=ws)

Use in Estimator:

# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
                      inputs=[diabetes_ds.as_named_input('diabetes')],
                      script_params=script_params,
                      compute_target = 'local',
                      environment_definition = diabetes_env,
                      entry_script='diabetes_training.py')
# Create an experiment
experiment = Experiment(workspace = ws, name = 'diabetes-training')
# Run the experiment
run = experiment.submit(config=estimator)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Register the environment

A

Having gone to the trouble of defining an environment with the packages you need, you can register it in the workspace.

# Register the environment
diabetes_env.register(workspace=ws)
27
Q

Run an Experiment on a Remote Compute Target

A

In many cases, your local compute resources may not be sufficient to process a complex or long-running experiment that needs to process a large volume of data; and you may want to take advantage of the ability to dynamically create and use compute resources in the cloud.

Azure ML supports a range of compute targets, which you can define in your workpace and use to run experiments; paying for the resources only when using them.

In this case, we’ll run the diabetes training experiment on a compute cluster with a unique name of your choosing

You can do this by specifying the compute_target parameter in the estimator (you can set this to either the name of the compute target, or a ComputeTarget object.)

Example

compute_config = AmlCompute.provisioning_configuration(vm_size=’STANDARD_D2_V2’, max_nodes=4)
training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
                      inputs=[diabetes_ds.as_named_input('diabetes')],
                      script_params=script_params,
                      compute_target = cluster_name, # Run the experiment on the remote compute target
                      environment_definition = registered_env,
                      entry_script='diabetes_training.py')
28
Q

Creating an Azure Machine Learning Pipeline

A

You can perform the various steps required to ingest data, train a model, and register the model individually by using the Azure ML SDK to run script-based experiments.

However, in an enterprise environment it is common to encapsulate the sequence of discrete steps required to build a machine learning solution into a pipeline that can be run on one or more compute targets, either on-demand by a user, from an automated build process, or on a schedule.

29
Q

Create Scripts for Pipeline Steps

A

Pipelines consist of one or more steps, which can be Python scripts, or specialized steps like an Auto ML training estimator or a data transfer step that copies data from one location to another. Each step can run in its own compute context.

The pipeline will eventually be published and run on-demand, so it needs a compute environment in which to run.

30
Q

Define Example Pipeline

A

from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.train.estimator import Estimator

# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes dataset")
# Create a PipelineData (Data Reference) for the model folder
model_folder = PipelineData("model_folder", datastore=ws.get_default_datastore())

estimator = Estimator(source_directory=experiment_folder,
compute_target = pipeline_cluster,
environment_definition=pipeline_run_config.environment,
entry_script=’train_diabetes.py’)

# Step 1, run the estimator to train the model
train_step = EstimatorStep(name = "Train Model",
                           estimator=estimator, 
                           estimator_entry_script_arguments=['--output_folder', model_folder],
                           inputs=[diabetes_ds.as_named_input('diabetes_train')],
                           outputs=[model_folder],
                           compute_target = pipeline_cluster,
                           allow_reuse = True)
# Step 2, run the model registration script
register_step = PythonScriptStep(name = "Register Model",
                                source_directory = experiment_folder,
                                script_name = "register_diabetes.py",
                                arguments = ['--model_folder', model_folder],
                                inputs=[model_folder],
                                compute_target = pipeline_cluster,
                                runconfig = pipeline_run_config,
                                allow_reuse = True)

print(“Pipeline steps defined”)

31
Q

Prepare a Compute Environment for the Pipeline Steps

A

The pipeline will eventually be published and run on-demand, so it needs a compute environment in which to run.

You can use the same compute for alls steps, but it’s important to realize that each step is run independently; so you could specify different compute contexts for each step if appropriate.

32
Q

Create and Run a Pipeline

A

First you need to define the steps for the pipeline, and any data references that need to passed between them, using a PipelineData object.

In this case, the first step must write the model to a folder that can be read from by the second step.

Since the steps will be run on remote compute (and in fact, could each be run on different compute), the folder path must be passed as a data reference to a location in a datastore within the workspace.

33
Q

The PipelineData object

A

The PipelineData object is a special kind of data reference that is used to pass data from the output of one pipeline step to the input of another, creating a dependency between them.

34
Q

Build the defined pipeline and run it as an experiment

A

from azureml.core import Experiment
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

# Construct the pipeline
pipeline_steps = [train_step, register_step]
pipeline = Pipeline(workspace = ws, steps=pipeline_steps)
print("Pipeline is built.")
# Create an experiment and run the pipeline
experiment = Experiment(workspace = ws, name = 'diabetes-training-pipeline')
pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)
print("Pipeline submitted for execution.")

RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion()

You can also monitor pipeline runs in the Experiments page in Azure Machine Learning studio.

35
Q

Publish a Pipeline

A

When you’ve created a pipeline and verified it works, you can publish it as a REST service

published_pipeline = pipeline.publish(name=”Diabetes_Training_Pipeline”,
description=”Trains diabetes model”, version=”1.0”)
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)

36
Q

Call a Pipeline

A

To use the endpoint, client applications need to make a REST call over HTTP. This request must be authenticated, so an authorization header is required. A real application would require a service principal with which to be authenticated

published_pipeline_run = PipelineRun(ws.experiments[experiment_name], run_id)

37
Q

Azure ML Pipelines vs Azure DevOps Pipelines

A

You can use the Azure Machine Learning extension for Azure DevOps to combine Azure ML pipelines with Azure DevOps pipelines and integrate model retraining into a continuous integration/continuous deployment (CI/CD) process.

For example you could use an Azure DevOps build pipeline to trigger an Azure ML pipeline that trains and registers a model, and when the model is registered it could trigger an Azure Devops release pipeline that deploys the model as a web service, along with the application or service that consumes the model.

38
Q

Register a Model

A
# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Inline Training'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})
39
Q

Deploy a Model as a Web Service

A
  1. We’re going to create a web service to host this model, and this will require some code and configuration files; so let’s create a folder for those.
  2. The web service where we deploy the model will need some Python code to load the input data, get the model from the workspace, and generate and return predictions. We’ll save this code in an score script that will be deployed to the web service
    • init() –> # Loads the mode when the service is loaded
    • run(input_data) –> # Called when a request is received
  3. The web service will be hosted in a Azure container instance (ACI), and the container will need to install any required Python dependencies when it gets initialized. So we’ll create a .yml file that tells the container host to install this into the environment.
  4. We’ll deploy the container a service named diabetes-service. The deployment process includes the following steps:
    • Define an inference configuration (scoring environment), which includes the scoring and environment files required to load and use the model.
      - Define a deployment configuration that defines the execution environment in which the service will be hosted. In this case, an Azure Container Instance.
    • Deploy the model as a web service.
    • Verify the status of the deployed service.

Example

from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig

# Configure the scoring environment
inference_config = InferenceConfig(runtime= "python",
                                   source_directory = folder_name,
                                   entry_script="score_diabetes.py",
                                   conda_file="diabetes_env.yml")

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)

service_name = “diabetes-service”

service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)

service.wait_for_deployment(True)
print(service.state)

40
Q

Score Script Example (real-time)

A
%%writefile $folder_name/score_diabetes.py
import json
import joblib
import numpy as np
from azureml.core.model import Model
# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load it
    model_path = Model.get_model_path('diabetes_model')
    model = joblib.load(model_path)
# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Get the corresponding classname for each prediction (0 or 1)
    classnames = ['not-diabetic', 'diabetic']
    predicted_classes = []
    for prediction in predictions:
        predicted_classes.append(classnames[prediction])
    # Return the predictions as JSON
    return json.dumps(predicted_classes)
41
Q

Consume ACI Web Service (SDK) (Real time inferencing)

A

With the service deployed, now you can consume it from a client application.

The code below uses the Azure ML SDK to connect to the containerized web service and use it to generate predictions from your diabetes classification model. In production, a model is likely to be consumed by business applications that do not use the Azure ML SDK, but simply make HTTP requests to the web service.

import json

This time our input is an array of two feature arrays
x_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
[0,148,58,11,179,39.19207553,0.160829008,45]]

# Convert the array or arrays to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})
# Call the web service, passing the input data
predictions = service.run(input_data = input_json)
# Get the predicted classes.
predicted_classes = json.loads(predictions)

for i in range(len(x_new)):
print (“Patient {}”.format(x_new[i]), predicted_classes[i] )

42
Q

Example Service Endpoint Uri

A

endpoint = service.scoring_uri
print(endpoint)

http://34733966-1951-4854-8c7c-1173ec0aae1b.northeurope.azurecontainer.io/score

43
Q

Consume ACI Web Service (REST Endpoint) (Real time inferencing)

A

Now that you know the endpoint URI, an application can simply make an HTTP request, sending the patient data in JSON (or binary) format, and receive back the predicted class(es).

import requests
import json

x_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
[0,148,58,11,179,39.19207553,0.160829008,45]]

# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})
# Set the content type
headers = { 'Content-Type':'application/json' }
predictions = requests.post(endpoint, input_json, headers = headers)
predicted_classes = json.loads(predictions.json())

for i in range(len(x_new)):
print (“Patient {}”.format(x_new[i]), predicted_classes[i] )

44
Q

Batch inferencing

A

to process data as a batch

45
Q

Create a Pipeline for Batch Inferencing

A
  1. Our pipeline will need Python code to perform the batch inferencing, so let’s create a folder where we can keep all the files used by the pipeline
  2. Now we’ll create a Python batch score/inference script to do the actual work, and save it in the pipeline folder
  3. Define Run Context with dependencies for the scoring script
  4. Define a ParallelRunStep config and a ParallelRunStep that calls the batch scoring script
  5. Create Pipeline including the ParallelRunStep
  6. Run the pipeline as an experiment
  7. Publish the Pipeline and use its REST Interface
46
Q

Batch Score Script Example (real-time)

A
%%writefile $experiment_folder/batch_diabetes.py
import os
import numpy as np
from azureml.core import Model
import joblib
def init():
    # Runs when the pipeline step is initialized
    global model
    # load the model
    model_path = Model.get_model_path('diabetes_model')
    model = joblib.load(model_path)
def run(mini_batch):
    # This runs for each batch
    resultList = []
    # process each file in the batch
    for f in mini_batch:
        # Read the comma-delimited data into an array
        data = np.genfromtxt(f, delimiter=',')
        # Reshape into a 2-dimensional array for prediction (model expects multiple items)
        prediction = model.predict(data.reshape(1, -1))
        # Append prediction to results
        resultList.append("{}: {}".format(os.path.basename(f), prediction[0]))
    return resultList
47
Q

ParallelRunStep

A

Enables the batch data to be processed in parallel and the results collated in a single output file

48
Q

Tuning Hyperparameters

A

There are many machine learning algorithms that require hyperparameters (parameter values that influence training, but can’t be determined from the training data itself).

For example, when training a logistic regression model, you can use a regularization rate hyperparameter to counteract bias in the model; or when training a convolutional neural network, you can use hyperparameters like learning rate and batch size to control how weights are adjusted and how many data items are processed in a mini-batch respectively.

The choice of hyperparameter values can significantly affect the performance of a trained model, or the time taken to train it; and often you need to try multiple combinations to find the optimal solution.

49
Q

Exampel Hyperdrive Experiment

A

from azureml.core import Experiment
from azureml.train.sklearn import SKLearn
from azureml.train.hyperdrive import GridParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, choice
from azureml.widgets import RunDetails

Sample a range of parameter values
params = GridParameterSampling(
{
# There’s only one parameter, so grid sampling will try each value - with multiple parameters it would try every combination
‘–regularization’: choice(0.001, 0.005, 0.01, 0.05, 0.1, 1.0)
}
)

# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes dataset")
# Create an estimator that uses the remote compute
hyper_estimator = SKLearn(source_directory=experiment_folder,
                          inputs=[diabetes_ds.as_named_input('diabetes')], # Pass the dataset as an input...
                          pip_packages=['azureml-sdk'], # ...so we need azureml-dataprep (it's in the SDK!)
                          entry_script='diabetes_training.py',
                          compute_target = training_cluster,)
# Configure hyperdrive settings
hyperdrive = HyperDriveConfig(estimator=hyper_estimator, 
                          hyperparameter_sampling=params, 
                          policy=None, 
                          primary_metric_name='AUC', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=6,
                          max_concurrent_runs=4)
# Run the experiment
experiment = Experiment(workspace = ws, name = 'diabates_training_hyperdrive')
run = experiment.submit(config=hyperdrive)

Show the status in the notebook as the experiment runs
RunDetails(run).show()
run.wait_for_completion()

50
Q

Hyperdrive Experiments

A

Azure Machine Learning includes a hyperparameter tuning capability through Hyperdrive experiments.

These experiments launch multiple child runs, each with a different hyperparameter combination.

The run producing the best model (as determined by the logged target performance metric for which you want to optimize) can be identified, and its trained model selected for registration and deployment.

51
Q

Hyperparameter Tuning - Determine the Best Performing Run

A

When all of the runs have finished, you can find the best one based on the performance metric you specified (in this case, the one with the best AUC).

for child_run in run.get_children_sorted_by_primary_metric():
print(child_run)

best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details() ['runDefinition']['arguments']

print(‘Best Run Id: ‘, best_run.id)
print(‘ -AUC:’, best_run_metrics[‘AUC’])
print(‘ -Accuracy:’, best_run_metrics[‘Accuracy’])
print(‘ -Regularization Rate:’,parameter_values)

from azureml.core import Model

# Register best model
best_run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                        tags={'Training context':'Hyperdrive'},
                        properties={'AUC': best_run_metrics['AUC'], 'Accuracy': best_run_metrics['Accuracy']})
# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')
52
Q

Automated Machine Learning

A

There are many kinds of machine learning algorithm that you can use to train a model, and sometimes it’s not easy to determine the most effective algorithm for your particular data and prediction requirements.

Additionally, you can significantly affect the predictive performance of a model by preprocessing the training data, using techniques such as normalization, missing feature imputation, and others. In your quest to find the best model for your requirements, you may need to try many combinations of algorithms and preprocessing transformations; which takes a lot of time and compute resources.

Azure Machine Learning enables you to automate the comparison of models trained using different algorithms and preprocessing options. You can use the visual interface in Azure Machine Learning studio or the SDK to leverage this capability. he SDK gives you greater control over the settings for the automated machine learning experiment, but the visual interface is easier to use. In this lab, you’ll explore automated machine learning using the SDK.

53
Q

Automated Machine Learning (SDK)

A

You don’t need to create a training script for automated machine learning, but you do need to

  1. create the the training and test data (split) and save to a datastore
  2. Setup a Compute
  3. Configure the Auto ML Experiment

4 Run an Automated Machine Learning Experiment

  1. Get the best model
  2. Register the best model
54
Q

Example Auto ML Experiment

A

To configure the automated machine learning experiment, you’ll need a run configuration that includes the required packages for the experiment environment, and a set of configuration settings that specifies how many combinations to try, which metric to use when evaluating models, and so on.

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(name=’Automated ML Experiment’,
task=’classification’,
compute_target=training_cluster,
training_data = train_ds,
validation_data = test_ds,
label_column_name=’Diabetic’,
iterations=6,
primary_metric = ‘AUC_weighted’,
max_concurrent_iterations=2,
featurization=’auto’
)

print(“Ready for Auto ML run.”)

55
Q

Run an Automated Machine Learning Experiment

A

from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails

print(‘Submitting Auto ML experiment…’)
automl_experiment = Experiment(ws, ‘diabetes_automl’)

automl_run = automl_experiment.submit(automl_config)

RunDetails(automl_run).show()

automl_run.wait_for_completion(show_output=True)

56
Q

Interpreting Models

A

You can use Azure Machine Learning to interpret a model by using an explainer that quantifies the amount of influence each feature contribues to the predicted label.

There are many common explainers, each suitable for different kinds of modeling algorithm; but the basic approach to using them is the same.

57
Q

Explainer

A

Quantifies the amount of influence each feature contribues to the predicted label. That is: How do the features in the data influence the prediction?

There are many kinds of explainer. In this example you’ll use a Tabular Explainer, which is a “black box” explainer that can be used to explain many kinds of model by invoking an appropriate SHAP model explainer.

58
Q

Get an Explainer for our Model

A

Get a suitable explainer for the model from the Azure ML interpretability library

59
Q

Get Global Feature Importance

A

The first thing to do is try to explain the model by evaluating the overall feature importance - in other words, quantifying the extent to which each feature influences the prediction based on the whole training dataset.

Output

Pregnancies : 0.2194762749294642
Age : 0.10575947971825919
BMI : 0.09306316543787874
SerumInsulin : 0.06734976452903166
PlasmaGlucose : 0.05007378902962012
TricepsThickness : 0.021124772576803175
DiastolicBloodPressure : 0.016574790766927222
DiabetesPedigree : 0.016206788169148716
60
Q

Get Local Feature Importance

A

So you have an overall view, but what about explaining individual observations? Let’s generate local explanations for individual predictions, quantifying the extent to which each feature influenced the decision to predict each of the possible label values.

In this case, it’s a binary model, so there are two possible labels (non-diabetic and diabetic); and you can quantify the influence of each feature for each of these label values for individual observations in a dataset. You’ll just evaluate the first two cases in the test dataset.

Output

Support for not-diabetic
Observation 1
SerumInsulin : 0.36925304330130265
Age : 0.2390809685204034
TricepsThickness : 0.025815337535141827
BMI : 0.012977411808708952
DiabetesPedigree : 0.002921802522673878
DiastolicBloodPressure : -0.015906526133378316
PlasmaGlucose : -0.036300469029731476
Pregnancies : -0.26441299709655025
———-
Total: 0.3334285714285707 Prediction: not-diabetic

Support for diabetic
Observation 1
Pregnancies : 0.26441299709655014
PlasmaGlucose : 0.03630046902973156
DiastolicBloodPressure : 0.015906526133378347
DiabetesPedigree : -0.002921802522673868
BMI : -0.012977411808708974
TricepsThickness : -0.025815337535141855
Age : -0.23908096852040375
SerumInsulin : -0.369253043301303
———-
Total: -0.3334285714285714 Prediction: not-diabetic

61
Q

Adding Explainability to Azure ML Models Training Experiments

A

You can generate explanations for models trained outside of Azure ML; but when you use experiments to train models in your Azure ML workspace, you can generate model explanations and log them.

62
Q

Train and Explain a Model using an Experiment Example

A
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
# Import Azure ML run library
from azureml.core.run import Run
# Import libraries for model explanation
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient
from interpret.ext.blackbox import TabularExplainer
# Get the experiment run context
run = Run.get_context()
# load the diabetes dataset
print("Loading Data...")
data = pd.read_csv('diabetes.csv')
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
labels = ['not-diabetic', 'diabetic']
# Separate features and labels
X, y = data[features].values, data['Diabetic'].values
# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
# Train a decision tree model
print('Training a decision tree model')
model = DecisionTreeClassifier().fit(X_train, y_train)
# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
run.log('Accuracy', np.float(acc))
# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
run.log('AUC', np.float(auc))
os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes.pkl')
# Get explanation
explainer = TabularExplainer(model, X_train, features=features, classes=labels)
explanation = explainer.explain_global(X_test)
# Get an Explanation Client and upload the explanation
explain_client = ExplanationClient.from_run(run)
explain_client.upload_model_explanation(explanation, comment='Tabular Explanation')
# Complete the run
run.complete()
63
Q

Monitoring a Model - Enable Application Insights

A

When you’ve deployed a model into production as a service, you’ll want to monitor it to track usage and explore the requests it processes.

# Enable AppInsights
aci_service.update(enable_app_insights=True)
print(aci_service.state)
print('AppInsights enabled!')
64
Q

Monitoring Data Drift

A
  1. Install the DataDriftDetector module
  2. Create a Baseline Dataset
  3. Create a Target Dataset
  4. Create a Data Drift Monitor
  5. Backfill the Monitor
  6. Analyze Data Drift