Explore data and train models (35–40%) Flashcards

(84 cards)

1
Q

What is automatically created when a workspace is provisioned?

A

Azure storage account, azure key vault, application insights, azure container registry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the ways to create a workspace

A

The user interface in Azure portal, Azure Resource Manager (ARM) template, Azure CLI, Azure ML python SDK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

workspace class has what params?

A

name, display_name, location, description

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to give workspace access to others?

A

Use role-based access control (RBAC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three general built-in roles?

A

Owner, contributor, reader

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Owner vs contributor

A

Contributor can do everything but give access to others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two built-in roles in AML?

A

AML data scientist: can do workspace actions except altering or deleting computes and editing workspace settings, AML compute operator: can create, change, and manage access the compute resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 4 different computes in AML workspace?

A

Compute Instance, Compute Cluster, Inference Cluster, Attached Compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Compute Instance details

A

Managed by workspace, good for small work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Compute cluster details

A

Workspace-managed on-demand clusters of CPU or GPU nodes. Automatically scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Inference Cluster details

A

Azure Kubernetes Service cluster for deployed ML models in production

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Attached compute details

A

Allows you to use Databricks or spark pools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are AML assets

A

Models, environments, data, and components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do you want to register your model in the workspace

A

It will be available in your workspace as opposed to only your local comp, and it gets a version number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are environments for

A

Environments provide any necessary components that are needed to execute the code, such as packages or environment variables. Environments need a name and a version to be created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do you need to create a data asset

A

Name, version, and path to the asset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What’s the point of components

A

Allows you to reuse and share code. To create; needs name, version, code, and environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the 4 options for model training in AML

A

Designer, Automated ML, Jupyter notebook, run a script as a job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s good about designer

A

Low code drag and drop components that are easy to manage and visualize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What’s good about automated

A

Automated will iterate through hyperparameters and algorithms to find the best selection for your use case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the different types of jobs?

A

Command, sweep, pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a command job

A

Job that executes a single script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a sweep job

A

Command job with hyperparameter tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a pipeline job

A

Multiple steps to a job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
ML studio: Author tab options?
Notebooks, Automated ML, Designer
26
ML studio: Assets tab options?
Data Jobs, Components, Pipelines, Environments, Models, Endpoints
27
ML studio: Manage tab options?
Compute, Linked Services, Data Labeling
28
What do you need to authenticate into a workspace?
subscription_id, resource_group, workspace_name
29
how is MLClient() used after authenticating?
need to call MLClient whenever you connect to the workspace, like creating or updating assets or resources
30
What can you do on the azure cli?
Basically, manage anything. It is also good for automating tasks. see: https://learn.microsoft.com/en-us/cli/azure/ml?view=azure-cli-latest
31
What are the params of a command job?
code: path to model, command: training script, environment: environment for script, compute: compute for script, display_name, experiment_name
32
How to create a job with MLClient()
mlClient().create_or_update(job)
33
What is a URI?
Uniform Resource Identifier
34
What are the common URI protocols?
http(s): public/private azure blob storage or public web location, abf(s): azure data lake G2, azureml: datastore in azure ml
35
What are the two authentication methods for datastores?
Credential-based: use service principal, shared access signature or account key, Identity-based: use Azure Active Directory identity or managed identity
36
what are the params of AzureBlobDatastore?
name, description, account_name, container_name, credentials
37
What are the benefits of using data assets?
Share and reuse data with other members, seamlessly access data during model training, version the metadata of the data asset
38
What are the 3 main types of data assets?
URI file: points to a specific file, URI folder: point to a folder, MLTable: point to a folder or file and includes a schema to read as tabular data
39
When creating a a URI file data asset, what are the supported paths?
local: ./, azure blob storage: wasbs://.blob.core.windows.net///, Azure Data Lake Storage G2: abfss://@.dfs.core.windows.net//, Datastore: azureml://datastores//paths//
40
What are the params of the Data() class for making a data asset?
path to the item, type of item: uses AssetTypes class, description, name, version
41
How to use argparse for a file data asset?
call argparse.ArguementParser(), add an argument and specify its type, then do argparse.ArgumentParser().parse_args(). then treat it like a normal item using argparse.ArgumentParser().parse_args().input_data using the appropriate function to interact with the asset
42
How to create a URI folder data asset?
Data(path,type=AssetTypes.URI_FOLDER,description,name,version), then ml_client.data.create_or_update(Data())
43
How to use argparse for a folder data asset?
data_path = argparse.ArgumentParser().add_argument('--input_data', type=str).parse_args().input data allfiles = glob.glob(data_path+"/*.csv") This would make a list of files
44
How to create a MLTable data asset?
You need a schema definition so you don't have to redefine the schema every time you want to call it. It's a yml file. Then do Data(path, type=AssetTypes.MLTABLE, description,name,version)
45
How to use argparse for a mltable asset?
args=argparse.ArgumentParser().add_argument("--input_data",type=str).parse_args() tbl=mltable.load(args.input_data) df = tbl.to_pandas_dataframe()
46
how to list datastores?
stores = ml_client.datastores.list() for ds_name in stores: print(ds_name.name)
47
How to make a datastore?
from azure.ai.ml.entities import store = (name, description, account_name, container_name, credentials) ml_client.create_or_update(store)
48
What are the three main AssetTypes?
AssetTypes.URI_FILE, AssetTypes.URI_FOLDER, AssetTypes.ML_TABLE
49
When using ml_client to create or update a data asset, how should the command look?
ml_client.data.create_or_update()
50
What do you need to create a compute instance?
A unique name and a size
51
What should you do if you keep forgetting to turn off your compute?
Schedule it to shut off at the end of the day
52
What are the params for a compute cluster
Amlcompute(name,type,size,location,min_instances,max_instances,idle_time_before_scale_down,tier)
53
what does the tier param mean for a compute cluster?
It specifies whether you have priority on the compute cluster. Low priority may be cheap, but you also may not get your cluster. It's like need vs want when rolling for loot in an mmo.
54
What are the three main scenarios where you want a compute cluster?
Running a pipeline job from designer, running an automated ML job, running a script as a job.
55
After you have created a compute cluster, what three things can you change about it?
the minimum number of nodes, the maximum number of nodes, and the idle time before scaling down
56
How to view all environments?
for env in ml_client.environments.list(): print(env.name)
57
How to view environment details?
env = ml_client.environments.get("
58
What to do when you have a docker image you want to use, but need a few more packages?
Add a conda specification file which will add more dependencies that you need
59
What is scaling and normalization?
actions on data that put all columns on the same scale so no one column has unequal influence on the model training
60
How to configure an automl job?
automl.(compute,experiment_name,training_data,target_column_name,primary_metric,n_cross_validations,enable_model_explainability)
61
What sort of data asset does automl need as an input?
Automl needs an MLTable as input
62
how to look up classification primary metrics for automl?
from azure.ai.ml.automl import ClassificationPrimaryMetrics list(classificationPrimaryMetrics)
63
What sort of limits can you set on an automl job?
timeout_minutes: max time for complete test, trial_timeout_minutes: Max time for one trial, max_trials: Max number of models to be trained, enable_early_termination: whether to end if score isn't improving in short term, max_concurrent_trials: limits use of cores in cluster computes
64
What are the data guardrails for classification models in automl?
Class balancing detection, missing feature values imputation, high cardinality feature detection
65
What do you need for a training pipeline?
You need the scripts to prepare the data and train the model, you need the yml files for the scripts so they can run, then you build the pipeline and run it.
66
What are discrete hyperparameters?
Hyperparameters that have a finite set of values
67
What are continuous hyperparameters?
Hyperparameters that can use any values along a scale, resulting in an infinite number of possibilities
68
How do you set up discrete hyperparameter with Choice()?
You can use a python list, a range, or an arbitrary list of comma-separated values
69
What are the discrete distributions that are available?
QUniform, QLogUniform, QNormal, QLogNormal
70
What are the continuous distributions that are available?
Uniform, LogUniform, Normal, LogNormal
71
What are the three main types of sampling?
Grid, Random, and Bayesian
72
What is grid sampling?
Tries every possible combination
73
What is random sampling?
Randomly chooses values from the search space
74
What is Bayesian sampling?
Chooses new values based on previous results
75
What is Sobol Sampling?
Random but with a seed so you can reproduce results
76
What are some limitations of Bayesian sampling?
You can only use choice, uniform, and quniform parameter expressions, and you can't use an early-termination policy.
77
What is an early termination policy for?
So you don't waste compute and time endlessly hyperparameterizing
78
What are the two main parameters of an early termination policy?
evaluation_interval and delay_evaluation
79
What is evaluation_interval?
basically how often you want to perform a termination check
80
What is delay_evaluation?
Basically delay termination checks for a minimum number of intervals
81
What are the three options for early termination policies?
Bandit policy, Median stopping policy, and truncation selection policy
82
What is bandit policy?
you specify a slack amount, and the hyperparameterizing will stop if the performance is lower than the difference between the best performing run and the slack amount. You can also specify a slack factor, which will be a ratio instead of a flat number.
83
What is median stopping policy?
It will stop the sweep when a new trial scores lower than the current median score
84
What is truncation selection policy?
You set a truncation percentage. If, at the time of the check, the trial is within the truncation percentage, the sweep will stop. EX: If the percentage is 20%, the sweep will stop if the performance is within the worst 20% of the models thus far