Explore data and train models (35–40%) Flashcards
What is automatically created when a workspace is provisioned?
Azure storage account, azure key vault, application insights, azure container registry
What are the ways to create a workspace
The user interface in Azure portal, Azure Resource Manager (ARM) template, Azure CLI, Azure ML python SDK
workspace class has what params?
name, display_name, location, description
How to give workspace access to others?
Use role-based access control (RBAC)
What are the three general built-in roles?
Owner, contributor, reader
Owner vs contributor
Contributor can do everything but give access to others
What are the two built-in roles in AML?
AML data scientist: can do workspace actions except altering or deleting computes and editing workspace settings, AML compute operator: can create, change, and manage access the compute resources
What are the 4 different computes in AML workspace?
Compute Instance, Compute Cluster, Inference Cluster, Attached Compute
Compute Instance details
Managed by workspace, good for small work
Compute cluster details
Workspace-managed on-demand clusters of CPU or GPU nodes. Automatically scaling
Inference Cluster details
Azure Kubernetes Service cluster for deployed ML models in production
Attached compute details
Allows you to use Databricks or spark pools
What are AML assets
Models, environments, data, and components
Why do you want to register your model in the workspace
It will be available in your workspace as opposed to only your local comp, and it gets a version number
What are environments for
Environments provide any necessary components that are needed to execute the code, such as packages or environment variables. Environments need a name and a version to be created
What do you need to create a data asset
Name, version, and path to the asset
What’s the point of components
Allows you to reuse and share code. To create; needs name, version, code, and environment
What are the 4 options for model training in AML
Designer, Automated ML, Jupyter notebook, run a script as a job
What’s good about designer
Low code drag and drop components that are easy to manage and visualize
What’s good about automated
Automated will iterate through hyperparameters and algorithms to find the best selection for your use case
What are the different types of jobs?
Command, sweep, pipeline
What is a command job
Job that executes a single script
What is a sweep job
Command job with hyperparameter tuning
What is a pipeline job
Multiple steps to a job
ML studio: Author tab options?
Notebooks, Automated ML, Designer
ML studio: Assets tab options?
Data Jobs, Components, Pipelines, Environments, Models, Endpoints
ML studio: Manage tab options?
Compute, Linked Services, Data Labeling
What do you need to authenticate into a workspace?
subscription_id, resource_group, workspace_name
how is MLClient() used after authenticating?
need to call MLClient whenever you connect to the workspace, like creating or updating assets or resources
What can you do on the azure cli?
Basically, manage anything. It is also good for automating tasks. see: https://learn.microsoft.com/en-us/cli/azure/ml?view=azure-cli-latest
What are the params of a command job?
code: path to model, command: training script, environment: environment for script, compute: compute for script, display_name, experiment_name
How to create a job with MLClient()
mlClient().create_or_update(job)
What is a URI?
Uniform Resource Identifier
What are the common URI protocols?
http(s): public/private azure blob storage or public web location, abf(s): azure data lake G2, azureml: datastore in azure ml
What are the two authentication methods for datastores?
Credential-based: use service principal, shared access signature or account key, Identity-based: use Azure Active Directory identity or managed identity
what are the params of AzureBlobDatastore?
name, description, account_name, container_name, credentials
What are the benefits of using data assets?
Share and reuse data with other members, seamlessly access data during model training, version the metadata of the data asset
What are the 3 main types of data assets?
URI file: points to a specific file, URI folder: point to a folder, MLTable: point to a folder or file and includes a schema to read as tabular data
When creating a a URI file data asset, what are the supported paths?
local: ./<path>, azure blob storage: wasbs://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>, Azure Data Lake Storage G2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>, Datastore: azureml://datastores/<datastore_name>/paths/<folder>/<file></file></folder></datastore_name></file></folder></account_name></file_system></file></folder></container_name></account_name></path>
What are the params of the Data() class for making a data asset?
path to the item, type of item: uses AssetTypes class, description, name, version
How to use argparse for a file data asset?
call argparse.ArguementParser(), add an argument and specify its type, then do argparse.ArgumentParser().parse_args(). then treat it like a normal item using argparse.ArgumentParser().parse_args().input_data using the appropriate function to interact with the asset
How to create a URI folder data asset?
Data(path,type=AssetTypes.URI_FOLDER,description,name,version), then ml_client.data.create_or_update(Data(<code>))</code>
How to use argparse for a folder data asset?
data_path = argparse.ArgumentParser().add_argument(‘–input_data’, type=str).parse_args().input data
allfiles = glob.glob(data_path+”/*.csv”)
This would make a list of files
How to create a MLTable data asset?
You need a schema definition so you don’t have to redefine the schema every time you want to call it. It’s a yml file. Then do Data(path, type=AssetTypes.MLTABLE, description,name,version)
How to use argparse for a mltable asset?
args=argparse.ArgumentParser().add_argument(“–input_data”,type=str).parse_args()
tbl=mltable.load(args.input_data)
df = tbl.to_pandas_dataframe()
how to list datastores?
stores = ml_client.datastores.list()
for ds_name in stores:
print(ds_name.name)
How to make a datastore?
from azure.ai.ml.entities import <datastoreType>
store = <datastoreType>(name, description, account_name, container_name, credentials)
ml_client.create_or_update(store)</datastoreType></datastoreType>
What are the three main AssetTypes?
AssetTypes.URI_FILE, AssetTypes.URI_FOLDER, AssetTypes.ML_TABLE
When using ml_client to create or update a data asset, how should the command look?
ml_client.data.create_or_update(<data_asset>)</data_asset>
What do you need to create a compute instance?
A unique name and a size
What should you do if you keep forgetting to turn off your compute?
Schedule it to shut off at the end of the day
What are the params for a compute cluster
Amlcompute(name,type,size,location,min_instances,max_instances,idle_time_before_scale_down,tier)
what does the tier param mean for a compute cluster?
It specifies whether you have priority on the compute cluster. Low priority may be cheap, but you also may not get your cluster. It’s like need vs want when rolling for loot in an mmo.
What are the three main scenarios where you want a compute cluster?
Running a pipeline job from designer, running an automated ML job, running a script as a job.
After you have created a compute cluster, what three things can you change about it?
the minimum number of nodes, the maximum number of nodes, and the idle time before scaling down
How to view all environments?
for env in ml_client.environments.list():
print(env.name)
How to view environment details?
env = ml_client.environments.get(“<environment_name”, version=<version_number)
print(env. description, env.tags)
What to do when you have a docker image you want to use, but need a few more packages?
Add a conda specification file which will add more dependencies that you need
What is scaling and normalization?
actions on data that put all columns on the same scale so no one column has unequal influence on the model training
How to configure an automl job?
automl.<model_type>(compute,experiment_name,training_data,target_column_name,primary_metric,n_cross_validations,enable_model_explainability)</model_type>
What sort of data asset does automl need as an input?
Automl needs an MLTable as input
how to look up classification primary metrics for automl?
from azure.ai.ml.automl import ClassificationPrimaryMetrics
list(classificationPrimaryMetrics)
What sort of limits can you set on an automl job?
timeout_minutes: max time for complete test, trial_timeout_minutes: Max time for one trial, max_trials: Max number of models to be trained, enable_early_termination: whether to end if score isn’t improving in short term, max_concurrent_trials: limits use of cores in cluster computes
What are the data guardrails for classification models in automl?
Class balancing detection, missing feature values imputation, high cardinality feature detection
What do you need for a training pipeline?
You need the scripts to prepare the data and train the model, you need the yml files for the scripts so they can run, then you build the pipeline and run it.
What are discrete hyperparameters?
Hyperparameters that have a finite set of values
What are continuous hyperparameters?
Hyperparameters that can use any values along a scale, resulting in an infinite number of possibilities
How do you set up discrete hyperparameter with Choice()?
You can use a python list, a range, or an arbitrary list of comma-separated values
What are the discrete distributions that are available?
QUniform, QLogUniform, QNormal, QLogNormal
What are the continuous distributions that are available?
Uniform, LogUniform, Normal, LogNormal
What are the three main types of sampling?
Grid, Random, and Bayesian
What is grid sampling?
Tries every possible combination
What is random sampling?
Randomly chooses values from the search space
What is Bayesian sampling?
Chooses new values based on previous results
What is Sobol Sampling?
Random but with a seed so you can reproduce results
What are some limitations of Bayesian sampling?
You can only use choice, uniform, and quniform parameter expressions, and you can’t use an early-termination policy.
What is an early termination policy for?
So you don’t waste compute and time endlessly hyperparameterizing
What are the two main parameters of an early termination policy?
evaluation_interval and delay_evaluation
What is evaluation_interval?
basically how often you want to perform a termination check
What is delay_evaluation?
Basically delay termination checks for a minimum number of intervals
What are the three options for early termination policies?
Bandit policy, Median stopping policy, and truncation selection policy
What is bandit policy?
you specify a slack amount, and the hyperparameterizing will stop if the performance is lower than the difference between the best performing run and the slack amount. You can also specify a slack factor, which will be a ratio instead of a flat number.
What is median stopping policy?
It will stop the sweep when a new trial scores lower than the current median score
What is truncation selection policy?
You set a truncation percentage. If, at the time of the check, the trial is within the truncation percentage, the sweep will stop. EX: If the percentage is 20%, the sweep will stop if the performance is within the worst 20% of the models thus far