Set up an Azure Machine Learning Workspace Flashcards
What is an AZ ML workspace?
he workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model.
What are the ways to create a workspaces?
- Use the Azure portal for a point-and-click interface to walk you through each step.
- Use the Azure Machine Learning SDK for Python to create a workspace on the fly from Python scripts or Jupiter notebooks
- Use an Azure Resource Manager template or the Azure Machine Learning CLI when you need to automate or customize the creation with corporate security standards.
- If you work in Visual Studio Code, use the VS Code extension.
How to create a workspace?
Create a workspace
To create a workspace, you need an Azure subscription. If you don’t have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.
Sign in to the Azure portal by using the credentials for your Azure subscription.
In the upper-left corner of Azure portal, select + Create a resource.
Use the search bar to find Machine Learning.
Select Machine Learning.
In the Machine Learning pane, select Create to begin.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace
What are the ways to create a workspace?
- AZ portal
- AZ CLI
- AZ REST
- AZ Resource manager template
How to register a blob container?
use register_azure_blob_container()
blob_datastore_name=’azblobsdk’ # Name of the datastore to workspace
container_name=os.getenv(“BLOB_CONTAINER”, “”) # Name of Azure blob container
account_name=os.getenv(“BLOB_ACCOUNTNAME”, “”) # Storage account name
account_key=os.getenv(“BLOB_ACCOUNT_KEY”, “”) # Storage account access key
blob_datastore = Datastore.register_azure_blob_container(workspace=ws,
datastore_name=blob_datastore_name,
container_name=container_name,
account_name=account_name,
account_key=account_key)
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data#create-and-register-datastores
How to register an Azure file share?
file_datastore = Datastore.register_azure_file_share(workspace=ws,
datastore_name=file_datastore_name,
file_share_name=file_share_name,
account_name=account_name,
account_key=account_key)
How to get a specific datastore registered in the current workspace?
# Get a named datastore from the current workspace datastore = Datastore.get(ws, datastore_name='your datastore name')
How to get a list of all datastores with a given workspace?
# List all datastores registered in the current workspace datastores = ws.datastores for name, datastore in datastores.items(): print(name, datastore.datastore_type)
How to get the default datastore?
datastore = ws.get_default_datastore()
How to change the default datastore?
ws.set_default_datastore(new_default_datastore)
which methods allow you to access datastores during scoring?
Batch prediction
What is the recommended dataset type for machine learning workflows?
We recommend FileDatasets for your machine learning workflows, since the source files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets
What is the FileDataset type?
A FileDataset references single or multiple files in your datastores or public URLs. If your data is already cleansed, and ready to use in training experiments, you can download or mount the files to your compute as a FileDataset object.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets
What is the TabularDataset type?
A TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas or Spark DataFrame so you can work with familiar data preparation and training libraries without having to leave your notebook. You can create a TabularDataset object from .csv, .tsv, .parquet, .jsonl files, and from SQL query results.
How to create a TabularDataSet?
from azureml.core import Workspace, Datastore, Dataset
datastore_name = ‘your datastore name’
# get existing workspace workspace = Workspace.from_config()
# retrieve an existing datastore in the workspace by name datastore = Datastore.get(workspace, datastore_name)
create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, ‘weather/2018/11.csv’),
(datastore, ‘weather/2018/12.csv’),
(datastore, ‘weather/2019/*.csv’)]
weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)
How to infer a column type?
titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path, set_column_types={‘Survived’: DataType.to_bool()})
The parameter infer_column_type is only applicable for datasets created from delimited files.
How to create a dataset from a pandas dataframe?
To create a TabularDataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file.
How to share datasets with others across experiments in a workspace?
Use the register() method.
titanic_ds = titanic_ds.register(workspace=workspace,
name=’titanic_ds’,
description=’titanic training data’)
How to set a new version for a certain dataset?
create_new_version = True
# create a new version of titanic_ds titanic_ds = titanic_ds.register(workspace = workspace, name = 'titanic_ds', description = 'new titanic training data', create_new_version = True)
What is an AZML compute instance?
An Azure Machine Learning compute instance is a managed cloud-based workstation for data scientists.
Compute instances make it easy to get started with Azure Machine Learning development as well as provide management and enterprise readiness capabilities for IT administrators.
Use a compute instance as your fully configured and managed development environment in the cloud for machine learning. They can also be used as a compute target for training and inferencing for development and testing purposes.
For production grade model training, use an Azure Machine Learning compute cluster with multi-node scaling capabilities. For production grade model deployment, use Azure Kubernetes Service cluster.
What tools and environments are already preinstalled on the compute instance?
Drivers CUDA cuDNN NVIDIA Blob FUSE Intel MPI library Azure CLI Azure Machine Learning samples Docker Nginx NCCL 2.0 Protobuf RStudio Server Open Source Edition (preview) R kernel Azure Machine Learning SDK for R Anaconda Python Jupyter and extensions Jupyterlab and extensions Azure Machine Learning SDK for Python from PyPI Includes most of the azureml extra packages. To see the full list, open a terminal window on your compute instance and run conda list -n azureml_py36 azureml* Other PyPI packages jupytext tensorboard nbconvert notebook Pillow Conda packages cython numpy ipykernel scikit-learn matplotlib tqdm joblib nodejs nb_conda_kernels Deep learning packages PyTorch TensorFlow Keras Horovod MLFlow pandas-ml scrapbook ONNX packages keras2onnx onnx onnxconverter-common skl2onnx onnxmltools
What does a compute target have?
- Has a job queue.
- Runs jobs securely in a virtual network environment, without requiring enterprises to open up SSH port. The job executes in a containerized environment and packages your model dependencies in a Docker container.
- Can run multiple small jobs in parallel (preview). Two jobs per core can run in parallel while the rest of the jobs are queued.
- Supports single-node multi-GPU distributed training jobs
What’s a script run configuration?
A ScriptRunConfig is used to configure the information necessary for submitting a training run as part of an experiment.
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets#compute-targets-for-training
What does a ScriptRunConfig need?
- source_directory: The source directory that contains your training script
- script: The training script to run
- compute_target: The compute target to run on
- environment: The environment to use when running the script
Wat is the code pattern to submit a training run?
Is it the same for all types of compute targets?
Create an experiment to run
Create an environment where the script will run
Create a ScriptRunConfig, which specifies the compute target and environment
Submit the run
Wait for the run to complete
Yes, it is the same for all.
Wat is the code pattern to submit a training run?
Write some python for all these steps:
# create experiment from azureml.core import Experiment
experiment_name = 'my_experiment' experiment = Experiment(workspace=ws, name=experiment_name)
# select a compute target #if If no compute target is specified in the ScriptRunConfig this will default be local. compute_target='local'
#create an environment from azureml.core import Workspace, Environment
ws = Workspace.from_config() myenv = Environment.get(workspace=ws, name="AzureML-Minimal")
from azureml.core import Environment
myenv = Environment("user-managed-env") myenv.python.user_managed_dependencies = True
# You can choose a specific Python environment by pointing to a Python path # myenv.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python
# creatge the scritp run configuration from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory=project_folder,
script=’train.py’,
compute_target=my_compute_target,
environment=myenv)
# Set compute target # Skip this if you are running on your local computer script_run_config.run_config.target = my_compute_target
#submit the experiment run = experiment.submit(config=src) run.wait_for_completion(show_output=True)
What is an environment for?
Azure Machine Learning environments are an encapsulation of the environment where your machine learning training happens. They specify the Python packages, Docker image, environment variables, and software settings around your training and scoring scripts. They also specify runtimes (Python, Spark, or Docker).
You can either define your own environment, or use an Azure ML curated environment. Curated environments are predefined environments that are available in your workspace by default. These environments are backed by cached Docker images which reduce the run preparation cost. See Azure Machine Learning Curated Environments for the full list of available curated environments.
What prerquisites do you need to install docker for windows.
BIOS-enabled virtualization
Windows 10 64-bit Professional
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Azure databricks
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. … For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Kafka, Event Hub, or IoT Hub.
Apache Spark
Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools.
Azure Data factory
It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores
Azure Container Instances
This is good for development or testing, not for production workloads!
Use Azure Container Instances for data processing where source data is ingested, processed, and placed in a durable store such as Azure Blob storage. By processing the data with ACI rather than statically-provisioned virtual machines, you can achieve significant cost savings through per-second billing.
Microsoft machine learning Spark
MMLSpark provides a number of deep learning and data science tools for Apache Spark , including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV , enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.
MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark .
Salient features
Easily ingest images from HDFS into Spark DataFrame ( example:301 )
Pre-process image data using transforms from OpenCV ( example:302 )
Featurize images using pre-trained deep neural nets using CNTK ( example:301 )
Train DNN-based image classification models on N-Series GPU VMs on Azure
Featurize free-form text data using convenient APIs on top of primitives in SparkML via a single transformer ( example:201 )
Train classification and regression models easily via implicit featurization of data ( example:101 )
Compute a rich set of evaluation metrics including per-instance metrics ( example:102 )
Azure HDInsight
Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more.
Azure Datalake Analytics
Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Easily develop and run massively parallel data transformation and processing programmes in U-SQL, R, Python and . … With no infrastructure to manage, you can process data on demand, scale instantly and only pay per job.
What are the ways to move data to and from azure blob storage
AZure storage explorere
AZCopy
Python
TensorFlow
TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.