Azure ML SDK Flashcards
Register a Datastore
from azureml.core import Workspace, Datastore ws = Workspace.from_config() # Register a new datastore blob_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name='blob_data', container_name='data_container', account_name='az_store_acct', account_key='123456abcde789…')
Get default Datastore
ws.get_default_datastore()
Tabular data from multiple csv files
blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
(blob_ds, ‘data/files/archive/*.csv’)]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
Register Tabular data
tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
Retrieve Tabular data
`ws.datasets['csv_table']` or `Dataset.get_by_name(ws, 'img_files')` or `img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)`
Register Dataset as new version
Dataset.File.from_files(path=img_paths).register(workspace=ws, name=’img_files’, create_new_version=True)
Dataset to pandas
df = tab_ds.to_pandas_dataframe()
azureml.core.environment
Azure Machine Learning environments specify the Python packages, environment variables, and software settings around your training and scoring scripts.
Run experiment with specific environment
from azureml.core import ScriptRunConfig, Experiment
from azureml.core.environment import Environment
exp = Experiment(name="myexp", workspace = ws) # Instantiate environment myenv = Environment(name="myenv")
# Add training script to run config runconfig = ScriptRunConfig(source_directory=".", script="train.py")
# Attach compute target to run config runconfig.run_config.target = "local"
# Attach environment to run config runconfig.run_config.environment = myenv
# Submit run run = exp.submit(runconfig)
PythonScriptStep
Runs a specified Python script
DataTransferStep
Uses Azure Data Factory to copy data between data stores.
DatabricksStep
Runs a notebook, script, or compiled JAR on a databricks cluster
ParallelRunStep
Runs a Python script as a distributed task on multiple compute nodes
Passing data between pipeline steps
- Define a named OutputFileDatasetConfig object that references a location in a datastore. If nor explicit datastore is specified, the default datastore is used.
- Pass the OutputFileDatasetConfig object as a script argument in steps that run scripts.
- Include code in those scripts to write to the OutputFileDatasetConfig argument as an output or read it as an input.
Pass dataset data as a script argument
You can pass a tabular dataset as a script argument. When you take this approach, the argument received by the script is the unique ID for the dataset in your workspace. In the script, you can then get the workspace from the run context and use it to retrieve the dataset by it’s ID.
script_config = ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, tab_ds],
environment=env)
Script:
from azureml.core import Run, Dataset
parser.add_argument(‘–ds’, type=str, dest=’dataset_id’)
args = parser.parse_args()
run = Run.get_context() ws = run.experiment.workspace dataset = Dataset.get_by_id(ws, id=args.dataset_id) data = dataset.to_pandas_dataframe()
Pass dataset to script as a named input
In this approach, you use the as_named_input method of the dataset to specify a name for the dataset. Then in the script, you can retrieve the dataset by name from the run context’s input_datasets collection without needing to retrieve it from the workspace. Note that if you use this approach, you still need to include a script argument for the dataset, even though you don’t actually use it to retrieve the dataset.
script_config = ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, tab_ds.as_named_input(‘my_dataset’)],
environment=env)
Script:
from azureml.core import Run
parser.add_argument(‘–ds’, type=str, dest=’ds_id’)
args = parser.parse_args()
run = Run.get_context() dataset = run.input_datasets['my_dataset'] data = dataset.to_pandas_dataframe()
Two ways to pass either Tabular of File dataset to script
1) Use a script argument for a dataset
If File dataset, you must specify a mode for the file dataset argument, which can be as_download or as_mount.
2) Use a named input for a dataset
passing File dataset as_download
In most cases, you should use as_download, which copies the files to a temporary location on the compute where the script is being run.
ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, file_ds.as_download()],
environment=env)
passing File dataset as_mount
If you are working with a large amount of data for which there may not be enough storage space on the experiment compute, use as_mount to stream the files directly from their source.
OutputFileDatasetConfig
The OutputFileDatasetConfig object is a special kind of dataset that:
- References a location in a datastore for interim storage of data.
- Creates a data dependency between pipeline steps.
You can view a OutputFileDatasetConfig object as an intermediary store for data that must be passed from a step to a subsequent step.
Forcing all pipeline steps to run
pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True)
Allow pipeline step to be reused
step1 = PythonScriptStep(…, allow_reuse = True)
continuous distribution types in hyperdrive
normal, uniform, lognormal, loguniform
grid sampling in hyperdrive
Grid sampling can only be employed when all hyperparameters are discrete, and is used to try every possible combination of parameters in the search space.
random sampling in hyperdrive
Random sampling is used to randomly select a value for each hyperparameter, which can be a mix of discrete and continuous values as shown in the following code example.
bayesian sampling in hyperdrive
Bayesian sampling chooses hyperparameter values based on the Bayesian optimization algorithm, which tries to select parameter combinations that will result in improved performance from the previous selection. Both discrete and continuous variables are possible.
bandit termination policy in hyperdrive
You can use a bandit policy to stop a run if the target performance metric underperforms the best run so far by a specified margin.
azureml.train.hyperdrive.BanditPolicy(slack_amount = 0.2,
evaluation_interval=1,
delay_evaluation=5)
Median stopping policy in hyperdrive
A median stopping policy abandons runs where the target performance metric is worse than the median of the running averages for all runs.
azureml.train.hyperdrive.MedianStoppingPolicy(evaluation_interval=1,
delay_evaluation=5)
Truncation selection policy in hyperdrive
A truncation selection policy cancels the lowest performing X% of runs at each evaluation interval based on the truncation_percentage value you specify for X.
azureml.train.hyperdrive.TruncationSelectionPolicy(truncation_percentage=10, evaluation_interval=1, delay_evaluation=5)
What is needed for running a hyperdrive experiment?
To run a hyperdrive experiment, you need to create a training script just the way you would do for any other training experiment, except that your script must:
- Include an argument for each hyperparameter you want to vary.
- Log the target performance metric. This enables the hyperdrive run to evaluate the performance of the child runs it initiates, and identify the one that produces the best performing model.
For example, the following example script trains a logistic regression model using a –regularization argument to set the regularization rate hyperparameter, and logs the accuracy metric with the name Accuracy.
list all hyperdrive runs in order of performance
for child_run in hyperdrive_run.get_children_sorted_by_primary_metric():
print(child_run)
retrieve the best performing hyperdrive run
best_run = hyperdrive_run.get_best_run_by_primary_metric()
use the ExplanationClient object to download the explanation
from azureml.contrib.interpret.explanation.explanation_client import ExplanationClient
client = ExplanationClient.from_run_id(workspace=ws,
experiment_name=experiment.experiment_name,
run_id=run.id)
explanation = client.download_model_explanation()
feature_importances = explanation.get_feature_importance_dict()
from interpret.ext.blackbox import MimicExplainer
An explainer that creates a global surrogate model that approximates your trained model and can be used to generate explanations. This explainable model must have the same kind of architecture as your trained model (for example, linear or tree-based).
from interpret.ext.blackbox import TabularExplainer
TabularExplainer - An explainer that acts as a wrapper around various SHAP explainer algorithms, automatically choosing the one that is most appropriate for your model architecture.
from interpret.ext.blackbox import PFIExplainer
PFIExplainer - a Permutation Feature Importance explainer that analyzes feature importance by shuffling feature values and measuring the impact on prediction performance.
retrieve global feature importance and get feature importances
explainer.explain_global(X_train).get_feature_importance_dict()
get local feature importance
To retrieve local feature importance from a MimicExplainer or a TabularExplainer, you must call the explain_local() method of your explainer, specifying the subset of cases you want to explain. Then you can use the get_ranked_local_names() and get_ranked_local_values() methods to retrieve dictionaries of the feature names and importance values, ranked by importance.
list_model_explanations()
Return a dictionary of explanation metadata such as id, data type, explanation method, model type, and upload time, sorted by upload time.
Scheduling a pipeline for periodic intervals
To schedule a pipeline to run at periodic intervals, you must define a ScheduleRecurrence that determines the run frequency, and use it to create a Schedule.
from azureml.pipeline.core import ScheduleRecurrence, Schedule
daily = ScheduleRecurrence(frequency=’Day’, interval=1)
Schedule.create(…, recurrence=daily)
Triggering a pipeline run on data changes
To schedule a pipeline to run whenever data changes, you must create a Schedule that monitors a specified path on a datastore.
from azureml.pipeline.core import Schedule
Schedule.create(…, datastore=training_datastore, path_on_datastore=’data/training’)
scheduling a pipeline
After you have published a pipeline, you can initiate it on demand through its REST endpoint, or you can have the pipeline run automatically based on a periodic schedule or in response to data updates.
steps needed to publish batch inference pipeline
- Register a model
- Create a scoring script
- init(): Called when the pipeline is initialized.
- run(mini_batch): Called for each batch of data to be processed. - Create a pipeline with a ParallelRunStep
- Publish using run.publish_pipeline
steps needed to publish a real-time inference pipeline
- Register a trained model
- Define an inference configuration
- A script to load the model and return predictions for submitted data.
- An environment in which the script will be run.
Combine these in an azureml.core.model.InferenceConfig - Define a deployment configuration
- Deploy model using azureml.core.model.Model.deploy()
defining a deployment config for real-time inference
Now that you have the entry script and environment, you need to configure the compute to which the service will be deployed. If you are deploying to an AKS cluster, you must create the cluster and a compute target for it before deploying:
from azureml.core.compute import ComputeTarget, AksCompute
cluster_name = ‘aks-cluster’
compute_config = AksCompute.provisioning_configuration(location=’eastus’)
production_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
production_cluster.wait_for_completion(show_output=True)
With the compute target created, you can now define the deployment configuration, which sets the target-specific compute specification for the containerized deployment:
from azureml.core.webservice import AksWebservice
classifier_deploy_config = AksWebservice.deploy_configuration(cpu_cores = 1,
memory_gb = 1)
ParallelRunConfig
The ParallelRunConfig class is used to specify configuration for the ParallelRunStep class. The ParallelRunConfig and ParallelRunStep classes together can be used for any kind of processing job that involves large amounts of data and is not time-sensitive, such as training or scoring. The ParallelRunStep works by breaking up a large job into batches that are processed in parallel. The batch size and degree of parallel processing can be controlled with the ParallelRunConfig class. ParallelRunStep can work with either TabularDataset or FileDataset as input.
To work with the ParallelRunStep class the following pattern is typical:
Create a ParallelRunConfig object to specify how batch processing is performed, with parameters to control batch size, number of nodes per compute target, and a reference to your custom Python script.
Create a ParallelRunStep object that uses the ParallelRunConfig object, defines inputs and outputs for the step, and list of models to use.
Use the configured ParallelRunStep object in a Pipeline just as you would with pipeline step types defined in the steps package.
parallel_run_step.txt
in ParallelRunConfig if output_action == ‘append_row’: All values output by run() method invocations will be aggregated into one unique file named parallel_run_step.txt that is created in the output location.
ParallelRunConfig output_action == ‘summary_only’
User script is expected to store the output by itself. An output row is still expected for each successful input item processed. The system uses this output only for error threshold calculation (ignoring the actual value of the row).
run.get_details_with_logs()
Return the status details of the run with log file contents. (dict)
Log a single numeric value with the same metric name repeatedly used (like from within a for loop)
for i in tqdm(range(-10, 10)): run.log(name=’Sigmoid’, value=1 / (1 + np.exp(-i))) angle = i / 2.0
Value in portal: Single-variable line chart
Log an array of numeric values
run.log_list(name=’Fibonacci’, value=[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89])
Value in portal: single-variable line chart
Log a row with 2 numerical columns repeatedly
run.log_row(name=’Cosine Wave’, angle=angle, cos=np.cos(angle)) sines[‘angle’].append(angle) sines[‘sine’].append(np.sin(angle))
Value in portal: Two-variable line chart
Log table with 2 numerical columns
run.log_table(name=’Sine Wave’, value=sines)
Value in portal: Two-variable line chart
Log image
run.log_image(name=’food’, path=’./breadpudding.jpg’, plot=None, description=’desert’)
Use this method to log an image file or a matplotlib plot to the run. These images will be visible and comparable in the run record
azureml.pipeline.core.PipelineRun.wait_for_completion()
Wait for the completion of this pipeline run. Returns the status after the wait.