Set up an Azure Machine Learning Workspace Flashcards
What is an AZ ML workspace?
he workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model.
What are the ways to create a workspaces?
- Use the Azure portal for a point-and-click interface to walk you through each step.
- Use the Azure Machine Learning SDK for Python to create a workspace on the fly from Python scripts or Jupiter notebooks
- Use an Azure Resource Manager template or the Azure Machine Learning CLI when you need to automate or customize the creation with corporate security standards.
- If you work in Visual Studio Code, use the VS Code extension.
How to create a workspace?
Create a workspace
To create a workspace, you need an Azure subscription. If you don’t have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.
Sign in to the Azure portal by using the credentials for your Azure subscription.
In the upper-left corner of Azure portal, select + Create a resource.
Use the search bar to find Machine Learning.
Select Machine Learning.
In the Machine Learning pane, select Create to begin.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace
What are the ways to create a workspace?
- AZ portal
- AZ CLI
- AZ REST
- AZ Resource manager template
How to register a blob container?
use register_azure_blob_container()
blob_datastore_name=’azblobsdk’ # Name of the datastore to workspace
container_name=os.getenv(“BLOB_CONTAINER”, “”) # Name of Azure blob container
account_name=os.getenv(“BLOB_ACCOUNTNAME”, “”) # Storage account name
account_key=os.getenv(“BLOB_ACCOUNT_KEY”, “”) # Storage account access key
blob_datastore = Datastore.register_azure_blob_container(workspace=ws,
datastore_name=blob_datastore_name,
container_name=container_name,
account_name=account_name,
account_key=account_key)
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data#create-and-register-datastores
How to register an Azure file share?
file_datastore = Datastore.register_azure_file_share(workspace=ws,
datastore_name=file_datastore_name,
file_share_name=file_share_name,
account_name=account_name,
account_key=account_key)
How to get a specific datastore registered in the current workspace?
# Get a named datastore from the current workspace datastore = Datastore.get(ws, datastore_name='your datastore name')
How to get a list of all datastores with a given workspace?
# List all datastores registered in the current workspace datastores = ws.datastores for name, datastore in datastores.items(): print(name, datastore.datastore_type)
How to get the default datastore?
datastore = ws.get_default_datastore()
How to change the default datastore?
ws.set_default_datastore(new_default_datastore)
which methods allow you to access datastores during scoring?
Batch prediction
What is the recommended dataset type for machine learning workflows?
We recommend FileDatasets for your machine learning workflows, since the source files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets
What is the FileDataset type?
A FileDataset references single or multiple files in your datastores or public URLs. If your data is already cleansed, and ready to use in training experiments, you can download or mount the files to your compute as a FileDataset object.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets
What is the TabularDataset type?
A TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas or Spark DataFrame so you can work with familiar data preparation and training libraries without having to leave your notebook. You can create a TabularDataset object from .csv, .tsv, .parquet, .jsonl files, and from SQL query results.
How to create a TabularDataSet?
from azureml.core import Workspace, Datastore, Dataset
datastore_name = ‘your datastore name’
# get existing workspace workspace = Workspace.from_config()
# retrieve an existing datastore in the workspace by name datastore = Datastore.get(workspace, datastore_name)
create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, ‘weather/2018/11.csv’),
(datastore, ‘weather/2018/12.csv’),
(datastore, ‘weather/2019/*.csv’)]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)
How to infer a column type?
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path, set_column_types={‘Survived’: DataType.to_bool()})
The parameter infer_column_type is only applicable for datasets created from delimited files.
How to create a dataset from a pandas dataframe?
To create a TabularDataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file.
How to share datasets with others across experiments in a workspace?
Use the register() method.
titanic_ds = titanic_ds.register(workspace=workspace,
name=’titanic_ds’,
description=’titanic training data’)
How to set a new version for a certain dataset?
create_new_version = True
# create a new version of titanic_ds titanic_ds = titanic_ds.register(workspace = workspace, name = 'titanic_ds', description = 'new titanic training data', create_new_version = True)
What is an AZML compute instance?
An Azure Machine Learning compute instance is a managed cloud-based workstation for data scientists.
Compute instances make it easy to get started with Azure Machine Learning development as well as provide management and enterprise readiness capabilities for IT administrators.
Use a compute instance as your fully configured and managed development environment in the cloud for machine learning. They can also be used as a compute target for training and inferencing for development and testing purposes.
For production grade model training, use an Azure Machine Learning compute cluster with multi-node scaling capabilities. For production grade model deployment, use Azure Kubernetes Service cluster.
What tools and environments are already preinstalled on the compute instance?
Drivers CUDA cuDNN NVIDIA Blob FUSE Intel MPI library Azure CLI Azure Machine Learning samples Docker Nginx NCCL 2.0 Protobuf RStudio Server Open Source Edition (preview) R kernel Azure Machine Learning SDK for R Anaconda Python Jupyter and extensions Jupyterlab and extensions Azure Machine Learning SDK for Python from PyPI Includes most of the azureml extra packages. To see the full list, open a terminal window on your compute instance and run conda list -n azureml_py36 azureml* Other PyPI packages jupytext tensorboard nbconvert notebook Pillow Conda packages cython numpy ipykernel scikit-learn matplotlib tqdm joblib nodejs nb_conda_kernels Deep learning packages PyTorch TensorFlow Keras Horovod MLFlow pandas-ml scrapbook ONNX packages keras2onnx onnx onnxconverter-common skl2onnx onnxmltools
What does a compute target have?
- Has a job queue.
- Runs jobs securely in a virtual network environment, without requiring enterprises to open up SSH port. The job executes in a containerized environment and packages your model dependencies in a Docker container.
- Can run multiple small jobs in parallel (preview). Two jobs per core can run in parallel while the rest of the jobs are queued.
- Supports single-node multi-GPU distributed training jobs
What’s a script run configuration?
A ScriptRunConfig is used to configure the information necessary for submitting a training run as part of an experiment.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets#compute-targets-for-training
What does a ScriptRunConfig need?
- source_directory: The source directory that contains your training script
- script: The training script to run
- compute_target: The compute target to run on
- environment: The environment to use when running the script