Azure ML SDK Flashcards

1
Q

Register a Datastore

A
from azureml.core import Workspace, Datastore
ws = Workspace.from_config()
# Register a new datastore
blob_ds = Datastore.register_azure_blob_container(workspace=ws, 
                                                  datastore_name='blob_data', 
                                                  container_name='data_container',
                                                  account_name='az_store_acct',
                                                  account_key='123456abcde789…')
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Get default Datastore

A

ws.get_default_datastore()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Tabular data from multiple csv files

A

blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
(blob_ds, ‘data/files/archive/*.csv’)]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Register Tabular data

A

tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Retrieve Tabular data

A
`ws.datasets['csv_table']` 
or 
`Dataset.get_by_name(ws, 'img_files')` 
or 
`img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)`
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Register Dataset as new version

A

Dataset.File.from_files(path=img_paths).register(workspace=ws, name=’img_files’, create_new_version=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dataset to pandas

A

df = tab_ds.to_pandas_dataframe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

azureml.core.environment

A

Azure Machine Learning environments specify the Python packages, environment variables, and software settings around your training and scoring scripts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Run experiment with specific environment

A

from azureml.core import ScriptRunConfig, Experiment
from azureml.core.environment import Environment

exp = Experiment(name="myexp", workspace = ws)
# Instantiate environment
myenv = Environment(name="myenv")
# Add training script to run config
runconfig = ScriptRunConfig(source_directory=".", script="train.py")
# Attach compute target to run config
runconfig.run_config.target = "local"
# Attach environment to run config
runconfig.run_config.environment = myenv
# Submit run 
run = exp.submit(runconfig)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

PythonScriptStep

A

Runs a specified Python script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DataTransferStep

A

Uses Azure Data Factory to copy data between data stores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DatabricksStep

A

Runs a notebook, script, or compiled JAR on a databricks cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ParallelRunStep

A

Runs a Python script as a distributed task on multiple compute nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Passing data between pipeline steps

A
  1. Define a named OutputFileDatasetConfig object that references a location in a datastore. If nor explicit datastore is specified, the default datastore is used.
  2. Pass the OutputFileDatasetConfig object as a script argument in steps that run scripts.
  3. Include code in those scripts to write to the OutputFileDatasetConfig argument as an output or read it as an input.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Pass dataset data as a script argument

A

You can pass a tabular dataset as a script argument. When you take this approach, the argument received by the script is the unique ID for the dataset in your workspace. In the script, you can then get the workspace from the run context and use it to retrieve the dataset by it’s ID.

script_config = ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, tab_ds],
environment=env)

Script:

from azureml.core import Run, Dataset

parser.add_argument(‘–ds’, type=str, dest=’dataset_id’)
args = parser.parse_args()

run = Run.get_context()
ws = run.experiment.workspace
dataset = Dataset.get_by_id(ws, id=args.dataset_id)
data = dataset.to_pandas_dataframe()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Pass dataset to script as a named input

A

In this approach, you use the as_named_input method of the dataset to specify a name for the dataset. Then in the script, you can retrieve the dataset by name from the run context’s input_datasets collection without needing to retrieve it from the workspace. Note that if you use this approach, you still need to include a script argument for the dataset, even though you don’t actually use it to retrieve the dataset.

script_config = ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, tab_ds.as_named_input(‘my_dataset’)],
environment=env)

Script:

from azureml.core import Run

parser.add_argument(‘–ds’, type=str, dest=’ds_id’)
args = parser.parse_args()

run = Run.get_context()
dataset = run.input_datasets['my_dataset']
data = dataset.to_pandas_dataframe()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Two ways to pass either Tabular of File dataset to script

A

1) Use a script argument for a dataset
If File dataset, you must specify a mode for the file dataset argument, which can be as_download or as_mount.

2) Use a named input for a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

passing File dataset as_download

A

In most cases, you should use as_download, which copies the files to a temporary location on the compute where the script is being run.

ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, file_ds.as_download()],
environment=env)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

passing File dataset as_mount

A

If you are working with a large amount of data for which there may not be enough storage space on the experiment compute, use as_mount to stream the files directly from their source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

OutputFileDatasetConfig

A

The OutputFileDatasetConfig object is a special kind of dataset that:

  • References a location in a datastore for interim storage of data.
  • Creates a data dependency between pipeline steps.

You can view a OutputFileDatasetConfig object as an intermediary store for data that must be passed from a step to a subsequent step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Forcing all pipeline steps to run

A

pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Allow pipeline step to be reused

A

step1 = PythonScriptStep(…, allow_reuse = True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

continuous distribution types in hyperdrive

A

normal, uniform, lognormal, loguniform

24
Q

grid sampling in hyperdrive

A

Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space.

25
Q

random sampling in hyperdrive

A

Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values as shown in the following code example.

26
Q

bayesian sampling in hyperdrive

A

Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection. Both discrete and continuous variables are possible.

27
Q

bandit termination policy in hyperdrive

A

You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin.

azureml.train.hyperdrive.BanditPolicy(slack_amount = 0.2,
evaluation_interval=1,
delay_evaluation=5)

28
Q

Median stopping policy in hyperdrive

A

A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs.

azureml.train.hyperdrive.MedianStoppingPolicy(evaluation_interval=1,
delay_evaluation=5)

29
Q

Truncation selection policy in hyperdrive

A

A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X.

azureml.train.hyperdrive.TruncationSelectionPolicy(truncation_percentage=10, evaluation_interval=1, delay_evaluation=5)

30
Q

What is needed for running a hyperdrive experiment?

A

To run a hyperdrive experiment, you need to create a training script just the way you would do for any other training experiment, except that your script must:

  • Include an argument for each hyperparameter you want to vary.
  • Log the target performance metric. This enables the hyperdrive run to evaluate the performance of the child runs it initiates, and identify the one that produces the best performing model.

For example, the following example script trains a logistic regression model using a –regularization argument to set the regularization rate hyperparameter, and logs the accuracy metric with the name Accuracy.

31
Q

list all hyperdrive runs in order of performance

A

for child_run in hyperdrive_run.get_children_sorted_by_primary_metric():
print(child_run)

32
Q

retrieve the best performing hyperdrive run

A

best_run = hyperdrive_run.get_best_run_by_primary_metric()

33
Q

use the ExplanationClient object to download the explanation

A

from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient

client = ExplanationClient.from_run_id(workspace=ws,
experiment_name=experiment.experiment_name,
run_id=run.id)
explanation = client.download_model_explanation()
feature_importances = explanation.get_feature_importance_dict()

34
Q

from interpret.ext.blackbox import MimicExplainer

A

An explainer that creates a global surrogate model that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (for example, linear or tree-based).

35
Q

from interpret.ext.blackbox import TabularExplainer

A

TabularExplainer - An explainer that acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture.

36
Q

from interpret.ext.blackbox import PFIExplainer

A

PFIExplainer - a Permutation Feature Importance explainer that analyzes feature importance by shuffling feature values and measuring the impact on prediction performance.

37
Q

retrieve global feature importance and get feature importances

A

explainer.explain_global(X_train).get_feature_importance_dict()

38
Q

get local feature importance

A

To retrieve local feature importance from a MimicExplainer or a TabularExplainer, you must call the explain_local() method of your explainer, specifying the subset of cases you want to explain. Then you can use the get_ranked_local_names() and get_ranked_local_values() methods to retrieve dictionaries of the feature names and importance values, ranked by importance.

39
Q

list_model_explanations()

A

Return a dictionary of explanation metadata such as id, data type, explanation method, model type, and upload time, sorted by upload time.

40
Q

Scheduling a pipeline for periodic intervals

A

To schedule a pipeline to run at periodic intervals, you must define a ScheduleRecurrence that determines the run frequency, and use it to create a Schedule.

from azureml.pipeline.core import ScheduleRecurrence, Schedule
daily = ScheduleRecurrence(frequency=’Day’, interval=1)
Schedule.create(…, recurrence=daily)

41
Q

Triggering a pipeline run on data changes

A

To schedule a pipeline to run whenever data changes, you must create a Schedule that monitors a specified path on a datastore.

from azureml.pipeline.core import Schedule

Schedule.create(…, datastore=training_datastore, path_on_datastore=’data/training’)

42
Q

scheduling a pipeline

A

After you have published a pipeline, you can initiate it on demand through its REST endpoint, or you can have the pipeline run automatically based on a periodic schedule or in response to data updates.

43
Q

steps needed to publish batch inference pipeline

A
  1. Register a model
  2. Create a scoring script
    - init(): Called when the pipeline is initialized.
    - run(mini_batch): Called for each batch of data to be processed.
  3. Create a pipeline with a ParallelRunStep
  4. Publish using run.publish_pipeline
44
Q

steps needed to publish a real-time inference pipeline

A
  1. Register a trained model
  2. Define an inference configuration
    - A script to load the model and return predictions for submitted data.
    - An environment in which the script will be run.
    Combine these in an azureml.core.model.InferenceConfig
  3. Define a deployment configuration
  4. Deploy model using azureml.core.model.Model.deploy()
45
Q

defining a deployment config for real-time inference

A

Now that you have the entry script and environment, you need to configure the compute to which the service will be deployed. If you are deploying to an AKS cluster, you must create the cluster and a compute target for it before deploying:

from azureml.core.compute import ComputeTarget, AksCompute

cluster_name = ‘aks-cluster’
compute_config = AksCompute.provisioning_configuration(location=’eastus’)
production_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
production_cluster.wait_for_completion(show_output=True)

With the compute target created, you can now define the deployment configuration, which sets the target-specific compute specification for the containerized deployment:

from azureml.core.webservice import AksWebservice

classifier_deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1,
memory_gb = 1)

46
Q

ParallelRunConfig

A

The ParallelRunConfig class is used to specify configuration for the ParallelRunStep class. The ParallelRunConfig and ParallelRunStep classes together can be used for any kind of processing job that involves large amounts of data and is not time-sensitive, such as training or scoring. The ParallelRunStep works by breaking up a large job into batches that are processed in parallel. The batch size and degree of parallel processing can be controlled with the ParallelRunConfig class. ParallelRunStep can work with either TabularDataset or FileDataset as input.

To work with the ParallelRunStep class the following pattern is typical:

Create a ParallelRunConfig object to specify how batch processing is performed, with parameters to control batch size, number of nodes per compute target, and a reference to your custom Python script.

Create a ParallelRunStep object that uses the ParallelRunConfig object, defines inputs and outputs for the step, and list of models to use.

Use the configured ParallelRunStep object in a Pipeline just as you would with pipeline step types defined in the steps package.

47
Q

parallel_run_step.txt

A

in ParallelRunConfig if output_action == ‘append_row’: All values output by run() method invocations will be aggregated into one unique file named parallel_run_step.txt that is created in the output location.

48
Q

ParallelRunConfig output_action == ‘summary_only’

A

User script is expected to store the output by itself. An output row is still expected for each successful input item processed. The system uses this output only for error threshold calculation (ignoring the actual value of the row).

49
Q

run.get_details_with_logs()

A

Return the status details of the run with log file contents. (dict)

50
Q

Log a single numeric value with the same metric name repeatedly used (like from within a for loop)

A

for i in tqdm(range(-10, 10)): run.log(name=’Sigmoid’, value=1 / (1 + np.exp(-i))) angle = i / 2.0

Value in portal: Single-variable line chart

51
Q

Log an array of numeric values

A

run.log_list(name=’Fibonacci’, value=[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89])

Value in portal: single-variable line chart

52
Q

Log a row with 2 numerical columns repeatedly

A

run.log_row(name=’Cosine Wave’, angle=angle, cos=np.cos(angle)) sines[‘angle’].append(angle) sines[‘sine’].append(np.sin(angle))

Value in portal: Two-variable line chart

53
Q

Log table with 2 numerical columns

A

run.log_table(name=’Sine Wave’, value=sines)

Value in portal: Two-variable line chart

54
Q

Log image

A

run.log_image(name=’food’, path=’./breadpudding.jpg’, plot=None, description=’desert’)

Use this method to log an image file or a matplotlib plot to the run. These images will be visible and comparable in the run record

55
Q

azureml.pipeline.core.PipelineRun.wait_for_completion()

A

Wait for the completion of this pipeline run. Returns the status after the wait.