Azure ML SDK Flashcards
Register a Datastore
from azureml.core import Workspace, Datastore ws = Workspace.from_config() # Register a new datastore blob_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name='blob_data', container_name='data_container', account_name='az_store_acct', account_key='123456abcde789…')
Get default Datastore
ws.get_default_datastore()
Tabular data from multiple csv files
blob_ds = ws.get_default_datastore()
csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
(blob_ds, ‘data/files/archive/*.csv’)]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
Register Tabular data
tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
Retrieve Tabular data
`ws.datasets['csv_table']` or `Dataset.get_by_name(ws, 'img_files')` or `img_ds = Dataset.get_by_name(workspace=ws, name='img_files', version=2)`
Register Dataset as new version
Dataset.File.from_files(path=img_paths).register(workspace=ws, name=’img_files’, create_new_version=True)
Dataset to pandas
df = tab_ds.to_pandas_dataframe()
azureml.core.environment
Azure Machine Learning environments specify the Python packages, environment variables, and software settings around your training and scoring scripts.
Run experiment with specific environment
from azureml.core import ScriptRunConfig, Experiment
from azureml.core.environment import Environment
exp = Experiment(name="myexp", workspace = ws) # Instantiate environment myenv = Environment(name="myenv")
# Add training script to run config runconfig = ScriptRunConfig(source_directory=".", script="train.py")
# Attach compute target to run config runconfig.run_config.target = "local"
# Attach environment to run config runconfig.run_config.environment = myenv
# Submit run run = exp.submit(runconfig)
PythonScriptStep
Runs a specified Python script
DataTransferStep
Uses Azure Data Factory to copy data between data stores.
DatabricksStep
Runs a notebook, script, or compiled JAR on a databricks cluster
ParallelRunStep
Runs a Python script as a distributed task on multiple compute nodes
Passing data between pipeline steps
- Define a named OutputFileDatasetConfig object that references a location in a datastore. If nor explicit datastore is specified, the default datastore is used.
- Pass the OutputFileDatasetConfig object as a script argument in steps that run scripts.
- Include code in those scripts to write to the OutputFileDatasetConfig argument as an output or read it as an input.
Pass dataset data as a script argument
You can pass a tabular dataset as a script argument. When you take this approach, the argument received by the script is the unique ID for the dataset in your workspace. In the script, you can then get the workspace from the run context and use it to retrieve the dataset by it’s ID.
script_config = ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, tab_ds],
environment=env)
Script:
from azureml.core import Run, Dataset
parser.add_argument(‘–ds’, type=str, dest=’dataset_id’)
args = parser.parse_args()
run = Run.get_context() ws = run.experiment.workspace dataset = Dataset.get_by_id(ws, id=args.dataset_id) data = dataset.to_pandas_dataframe()
Pass dataset to script as a named input
In this approach, you use the as_named_input method of the dataset to specify a name for the dataset. Then in the script, you can retrieve the dataset by name from the run context’s input_datasets collection without needing to retrieve it from the workspace. Note that if you use this approach, you still need to include a script argument for the dataset, even though you don’t actually use it to retrieve the dataset.
script_config = ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, tab_ds.as_named_input(‘my_dataset’)],
environment=env)
Script:
from azureml.core import Run
parser.add_argument(‘–ds’, type=str, dest=’ds_id’)
args = parser.parse_args()
run = Run.get_context() dataset = run.input_datasets['my_dataset'] data = dataset.to_pandas_dataframe()
Two ways to pass either Tabular of File dataset to script
1) Use a script argument for a dataset
If File dataset, you must specify a mode for the file dataset argument, which can be as_download or as_mount.
2) Use a named input for a dataset
passing File dataset as_download
In most cases, you should use as_download, which copies the files to a temporary location on the compute where the script is being run.
ScriptRunConfig(source_directory=’my_dir’,
script=’script.py’,
arguments=[’–ds’, file_ds.as_download()],
environment=env)
passing File dataset as_mount
If you are working with a large amount of data for which there may not be enough storage space on the experiment compute, use as_mount to stream the files directly from their source.
OutputFileDatasetConfig
The OutputFileDatasetConfig object is a special kind of dataset that:
- References a location in a datastore for interim storage of data.
- Creates a data dependency between pipeline steps.
You can view a OutputFileDatasetConfig object as an intermediary store for data that must be passed from a step to a subsequent step.
Forcing all pipeline steps to run
pipeline_run = experiment.submit(train_pipeline, regenerate_outputs=True)
Allow pipeline step to be reused
step1 = PythonScriptStep(…, allow_reuse = True)