Set up an Azure Machine Learning Workspace Flashcards

1
Q

What is an AZ ML workspace?

A

he workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the ways to create a workspaces?

A
  • Use the Azure portal for a point-and-click interface to walk you through each step.
  • Use the Azure Machine Learning SDK for Python to create a workspace on the fly from Python scripts or Jupiter notebooks
  • Use an Azure Resource Manager template or the Azure Machine Learning CLI when you need to automate or customize the creation with corporate security standards.
  • If you work in Visual Studio Code, use the VS Code extension.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to create a workspace?

A

Create a workspace
To create a workspace, you need an Azure subscription. If you don’t have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.

Sign in to the Azure portal by using the credentials for your Azure subscription.

In the upper-left corner of Azure portal, select + Create a resource.

Use the search bar to find Machine Learning.

Select Machine Learning.

In the Machine Learning pane, select Create to begin.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the ways to create a workspace?

A
  • AZ portal
  • AZ CLI
  • AZ REST
  • AZ Resource manager template
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to register a blob container?

A

use register_azure_blob_container()

blob_datastore_name=’azblobsdk’ # Name of the datastore to workspace
container_name=os.getenv(“BLOB_CONTAINER”, “”) # Name of Azure blob container
account_name=os.getenv(“BLOB_ACCOUNTNAME”, “”) # Storage account name
account_key=os.getenv(“BLOB_ACCOUNT_KEY”, “”) # Storage account access key

blob_datastore = Datastore.register_azure_blob_container(workspace=ws,
datastore_name=blob_datastore_name,
container_name=container_name,
account_name=account_name,

account_key=account_key)

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data#create-and-register-datastores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to register an Azure file share?

A

file_datastore = Datastore.register_azure_file_share(workspace=ws,
datastore_name=file_datastore_name,
file_share_name=file_share_name,
account_name=account_name,
account_key=account_key)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to get a specific datastore registered in the current workspace?

A
# Get a named datastore from the current workspace
datastore = Datastore.get(ws, datastore_name='your datastore name')
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to get a list of all datastores with a given workspace?

A
# List all datastores registered in the current workspace
datastores = ws.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to get the default datastore?

A

datastore = ws.get_default_datastore()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to change the default datastore?

A

ws.set_default_datastore(new_default_datastore)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

which methods allow you to access datastores during scoring?

A

Batch prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the recommended dataset type for machine learning workflows?

A

We recommend FileDatasets for your machine learning workflows, since the source files can be in any format, which enables a wider range of machine learning scenarios, including deep learning.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the FileDataset type?

A

A FileDataset references single or multiple files in your datastores or public URLs. If your data is already cleansed, and ready to use in training experiments, you can download or mount the files to your compute as a FileDataset object.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the TabularDataset type?

A

A TabularDataset represents data in a tabular format by parsing the provided file or list of files. This provides you with the ability to materialize the data into a pandas or Spark DataFrame so you can work with familiar data preparation and training libraries without having to leave your notebook. You can create a TabularDataset object from .csv, .tsv, .parquet, .jsonl files, and from SQL query results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to create a TabularDataSet?

A

from azureml.core import Workspace, Datastore, Dataset

datastore_name = ‘your datastore name’

# get existing workspace
workspace = Workspace.from_config()
# retrieve an existing datastore in the workspace by name
datastore = Datastore.get(workspace, datastore_name)

create a TabularDataset from 3 file paths in datastore
datastore_paths = [(datastore, ‘weather/2018/11.csv’),
(datastore, ‘weather/2018/12.csv’),
(datastore, ‘weather/2019/*.csv’)]

weather_ds = Dataset.Tabular.from_delimited_files(path=datastore_paths)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to infer a column type?

A

titanic_ds = Dataset.Tabular.from_delimited_files(path=web_path, set_column_types={‘Survived’: DataType.to_bool()})

The parameter infer_column_type is only applicable for datasets created from delimited files.

17
Q

How to create a dataset from a pandas dataframe?

A

To create a TabularDataset from an in memory pandas dataframe, write the data to a local file, like a csv, and create your dataset from that file.

18
Q

How to share datasets with others across experiments in a workspace?

A

Use the register() method.

titanic_ds = titanic_ds.register(workspace=workspace,
name=’titanic_ds’,
description=’titanic training data’)

19
Q

How to set a new version for a certain dataset?

A

create_new_version = True

# create a new version of titanic_ds
titanic_ds = titanic_ds.register(workspace = workspace,
                                 name = 'titanic_ds',
                                 description = 'new titanic training data',
                                 create_new_version = True)
20
Q

What is an AZML compute instance?

A

An Azure Machine Learning compute instance is a managed cloud-based workstation for data scientists.

Compute instances make it easy to get started with Azure Machine Learning development as well as provide management and enterprise readiness capabilities for IT administrators.

Use a compute instance as your fully configured and managed development environment in the cloud for machine learning. They can also be used as a compute target for training and inferencing for development and testing purposes.

For production grade model training, use an Azure Machine Learning compute cluster with multi-node scaling capabilities. For production grade model deployment, use Azure Kubernetes Service cluster.

21
Q

What tools and environments are already preinstalled on the compute instance?

A
Drivers	
  CUDA
  cuDNN
  NVIDIA
  Blob FUSE
Intel MPI library	
Azure CLI	
Azure Machine Learning samples	
Docker	
Nginx	
NCCL 2.0	
Protobuf
RStudio Server Open Source Edition (preview)	
R kernel	
Azure Machine Learning SDK for R 
Anaconda Python	
Jupyter and extensions	
Jupyterlab and extensions	
Azure Machine Learning SDK for Python
from PyPI	Includes most of the azureml extra packages. To see the full list, open a terminal window on your compute instance and run
conda list -n azureml_py36 azureml*
Other PyPI packages	jupytext
tensorboard
nbconvert
notebook
Pillow
Conda packages	cython
numpy
ipykernel
scikit-learn
matplotlib
tqdm
joblib
nodejs
nb_conda_kernels
Deep learning packages	PyTorch
TensorFlow
Keras
Horovod
MLFlow
pandas-ml
scrapbook
ONNX packages	keras2onnx
onnx
onnxconverter-common
skl2onnx
onnxmltools
22
Q

What does a compute target have?

A
  • Has a job queue.
  • Runs jobs securely in a virtual network environment, without requiring enterprises to open up SSH port. The job executes in a containerized environment and packages your model dependencies in a Docker container.
  • Can run multiple small jobs in parallel (preview). Two jobs per core can run in parallel while the rest of the jobs are queued.
  • Supports single-node multi-GPU distributed training jobs
23
Q

What’s a script run configuration?

A

A ScriptRunConfig is used to configure the information necessary for submitting a training run as part of an experiment.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets#compute-targets-for-training

24
Q

What does a ScriptRunConfig need?

A
  • source_directory: The source directory that contains your training script
  • script: The training script to run
  • compute_target: The compute target to run on
  • environment: The environment to use when running the script
25
Q

Wat is the code pattern to submit a training run?

Is it the same for all types of compute targets?

A

Create an experiment to run
Create an environment where the script will run
Create a ScriptRunConfig, which specifies the compute target and environment
Submit the run
Wait for the run to complete

Yes, it is the same for all.

26
Q

Wat is the code pattern to submit a training run?

Write some python for all these steps:

A
# create experiment
from azureml.core import Experiment
experiment_name = 'my_experiment'
experiment = Experiment(workspace=ws, name=experiment_name)
# select a compute target
#if If no compute target is  specified in the ScriptRunConfig this will default be local. 
compute_target='local' 
#create an environment 
from azureml.core import Workspace, Environment
ws = Workspace.from_config()
myenv = Environment.get(workspace=ws, name="AzureML-Minimal")

from azureml.core import Environment

myenv = Environment("user-managed-env")
myenv.python.user_managed_dependencies = True
# You can choose a specific Python environment by pointing to a Python path 
# myenv.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python
# creatge the scritp run configuration 
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=project_folder,
script=’train.py’,
compute_target=my_compute_target,
environment=myenv)

# Set compute target
# Skip this if you are running on your local computer
script_run_config.run_config.target = my_compute_target
#submit the experiment 
run = experiment.submit(config=src)
run.wait_for_completion(show_output=True)
27
Q

What is an environment for?

A

Azure Machine Learning environments are an encapsulation of the environment where your machine learning training happens. They specify the Python packages, Docker image, environment variables, and software settings around your training and scoring scripts. They also specify runtimes (Python, Spark, or Docker).

You can either define your own environment, or use an Azure ML curated environment. Curated environments are predefined environments that are available in your workspace by default. These environments are backed by cached Docker images which reduce the run preparation cost. See Azure Machine Learning Curated Environments for the full list of available curated environments.

28
Q

What prerquisites do you need to install docker for windows.

A

BIOS-enabled virtualization

Windows 10 64-bit Professional

29
Q

Apache Hive

A

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

30
Q

Azure databricks

A

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. … For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real-time using Kafka, Event Hub, or IoT Hub.

31
Q

Apache Spark

A

Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools.

32
Q

Azure Data factory

A

It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores

33
Q

Azure Container Instances

A

This is good for development or testing, not for production workloads!

Use Azure Container Instances for data processing where source data is ingested, processed, and placed in a durable store such as Azure Blob storage. By processing the data with ACI rather than statically-provisioned virtual machines, you can achieve significant cost savings through per-second billing.

34
Q

Microsoft machine learning Spark

A

MMLSpark provides a number of deep learning and data science tools for Apache Spark , including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV , enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.

MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark .

Salient features
Easily ingest images from HDFS into Spark DataFrame ( example:301 )
Pre-process image data using transforms from OpenCV ( example:302 )
Featurize images using pre-trained deep neural nets using CNTK ( example:301 )
Train DNN-based image classification models on N-Series GPU VMs on Azure
Featurize free-form text data using convenient APIs on top of primitives in SparkML via a single transformer ( example:201 )
Train classification and regression models easily via implicit featurization of data ( example:101 )
Compute a rich set of evaluation metrics including per-instance metrics ( example:102 )

35
Q

Azure HDInsight

A

Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. You can use the most popular open-source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, R, and more.

36
Q

Azure Datalake Analytics

A

Azure Data Lake Analytics is an on-demand analytics job service that simplifies big data. Easily develop and run massively parallel data transformation and processing programmes in U-SQL, R, Python and . … With no infrastructure to manage, you can process data on demand, scale instantly and only pay per job.

37
Q

What are the ways to move data to and from azure blob storage

A

AZure storage explorere
AZCopy
Python

38
Q

TensorFlow

A

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications.