Explore data and train models (35–40%) Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is automatically created when a workspace is provisioned?

A

Azure storage account, azure key vault, application insights, azure container registry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the ways to create a workspace

A

The user interface in Azure portal, Azure Resource Manager (ARM) template, Azure CLI, Azure ML python SDK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

workspace class has what params?

A

name, display_name, location, description

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to give workspace access to others?

A

Use role-based access control (RBAC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three general built-in roles?

A

Owner, contributor, reader

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Owner vs contributor

A

Contributor can do everything but give access to others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two built-in roles in AML?

A

AML data scientist: can do workspace actions except altering or deleting computes and editing workspace settings, AML compute operator: can create, change, and manage access the compute resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 4 different computes in AML workspace?

A

Compute Instance, Compute Cluster, Inference Cluster, Attached Compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Compute Instance details

A

Managed by workspace, good for small work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Compute cluster details

A

Workspace-managed on-demand clusters of CPU or GPU nodes. Automatically scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Inference Cluster details

A

Azure Kubernetes Service cluster for deployed ML models in production

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Attached compute details

A

Allows you to use Databricks or spark pools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are AML assets

A

Models, environments, data, and components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do you want to register your model in the workspace

A

It will be available in your workspace as opposed to only your local comp, and it gets a version number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are environments for

A

Environments provide any necessary components that are needed to execute the code, such as packages or environment variables. Environments need a name and a version to be created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do you need to create a data asset

A

Name, version, and path to the asset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What’s the point of components

A

Allows you to reuse and share code. To create; needs name, version, code, and environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the 4 options for model training in AML

A

Designer, Automated ML, Jupyter notebook, run a script as a job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s good about designer

A

Low code drag and drop components that are easy to manage and visualize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What’s good about automated

A

Automated will iterate through hyperparameters and algorithms to find the best selection for your use case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the different types of jobs?

A

Command, sweep, pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a command job

A

Job that executes a single script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is a sweep job

A

Command job with hyperparameter tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a pipeline job

A

Multiple steps to a job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

ML studio: Author tab options?

A

Notebooks, Automated ML, Designer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

ML studio: Assets tab options?

A

Data Jobs, Components, Pipelines, Environments, Models, Endpoints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

ML studio: Manage tab options?

A

Compute, Linked Services, Data Labeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What do you need to authenticate into a workspace?

A

subscription_id, resource_group, workspace_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

how is MLClient() used after authenticating?

A

need to call MLClient whenever you connect to the workspace, like creating or updating assets or resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What can you do on the azure cli?

A

Basically, manage anything. It is also good for automating tasks. see: https://learn.microsoft.com/en-us/cli/azure/ml?view=azure-cli-latest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the params of a command job?

A

code: path to model, command: training script, environment: environment for script, compute: compute for script, display_name, experiment_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How to create a job with MLClient()

A

mlClient().create_or_update(job)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is a URI?

A

Uniform Resource Identifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are the common URI protocols?

A

http(s): public/private azure blob storage or public web location, abf(s): azure data lake G2, azureml: datastore in azure ml

35
Q

What are the two authentication methods for datastores?

A

Credential-based: use service principal, shared access signature or account key, Identity-based: use Azure Active Directory identity or managed identity

36
Q

what are the params of AzureBlobDatastore?

A

name, description, account_name, container_name, credentials

37
Q

What are the benefits of using data assets?

A

Share and reuse data with other members, seamlessly access data during model training, version the metadata of the data asset

38
Q

What are the 3 main types of data assets?

A

URI file: points to a specific file, URI folder: point to a folder, MLTable: point to a folder or file and includes a schema to read as tabular data

39
Q

When creating a a URI file data asset, what are the supported paths?

A

local: ./<path>, azure blob storage: wasbs://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>, Azure Data Lake Storage G2: abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>, Datastore: azureml://datastores/<datastore_name>/paths/<folder>/<file></file></folder></datastore_name></file></folder></account_name></file_system></file></folder></container_name></account_name></path>

40
Q

What are the params of the Data() class for making a data asset?

A

path to the item, type of item: uses AssetTypes class, description, name, version

41
Q

How to use argparse for a file data asset?

A

call argparse.ArguementParser(), add an argument and specify its type, then do argparse.ArgumentParser().parse_args(). then treat it like a normal item using argparse.ArgumentParser().parse_args().input_data using the appropriate function to interact with the asset

42
Q

How to create a URI folder data asset?

A

Data(path,type=AssetTypes.URI_FOLDER,description,name,version), then ml_client.data.create_or_update(Data(<code>))</code>

43
Q

How to use argparse for a folder data asset?

A

data_path = argparse.ArgumentParser().add_argument(‘–input_data’, type=str).parse_args().input data
allfiles = glob.glob(data_path+”/*.csv”)
This would make a list of files

44
Q

How to create a MLTable data asset?

A

You need a schema definition so you don’t have to redefine the schema every time you want to call it. It’s a yml file. Then do Data(path, type=AssetTypes.MLTABLE, description,name,version)

45
Q

How to use argparse for a mltable asset?

A

args=argparse.ArgumentParser().add_argument(“–input_data”,type=str).parse_args()
tbl=mltable.load(args.input_data)
df = tbl.to_pandas_dataframe()

46
Q

how to list datastores?

A

stores = ml_client.datastores.list()
for ds_name in stores:
print(ds_name.name)

47
Q

How to make a datastore?

A

from azure.ai.ml.entities import <datastoreType>
store = <datastoreType>(name, description, account_name, container_name, credentials)
ml_client.create_or_update(store)</datastoreType></datastoreType>

48
Q

What are the three main AssetTypes?

A

AssetTypes.URI_FILE, AssetTypes.URI_FOLDER, AssetTypes.ML_TABLE

49
Q

When using ml_client to create or update a data asset, how should the command look?

A

ml_client.data.create_or_update(<data_asset>)</data_asset>

50
Q

What do you need to create a compute instance?

A

A unique name and a size

51
Q

What should you do if you keep forgetting to turn off your compute?

A

Schedule it to shut off at the end of the day

52
Q

What are the params for a compute cluster

A

Amlcompute(name,type,size,location,min_instances,max_instances,idle_time_before_scale_down,tier)

53
Q

what does the tier param mean for a compute cluster?

A

It specifies whether you have priority on the compute cluster. Low priority may be cheap, but you also may not get your cluster. It’s like need vs want when rolling for loot in an mmo.

54
Q

What are the three main scenarios where you want a compute cluster?

A

Running a pipeline job from designer, running an automated ML job, running a script as a job.

55
Q

After you have created a compute cluster, what three things can you change about it?

A

the minimum number of nodes, the maximum number of nodes, and the idle time before scaling down

56
Q

How to view all environments?

A

for env in ml_client.environments.list():
print(env.name)

57
Q

How to view environment details?

A

env = ml_client.environments.get(“<environment_name”, version=<version_number)
print(env. description, env.tags)

58
Q

What to do when you have a docker image you want to use, but need a few more packages?

A

Add a conda specification file which will add more dependencies that you need

59
Q

What is scaling and normalization?

A

actions on data that put all columns on the same scale so no one column has unequal influence on the model training

60
Q

How to configure an automl job?

A

automl.<model_type>(compute,experiment_name,training_data,target_column_name,primary_metric,n_cross_validations,enable_model_explainability)</model_type>

61
Q

What sort of data asset does automl need as an input?

A

Automl needs an MLTable as input

62
Q

how to look up classification primary metrics for automl?

A

from azure.ai.ml.automl import ClassificationPrimaryMetrics
list(classificationPrimaryMetrics)

63
Q

What sort of limits can you set on an automl job?

A

timeout_minutes: max time for complete test, trial_timeout_minutes: Max time for one trial, max_trials: Max number of models to be trained, enable_early_termination: whether to end if score isn’t improving in short term, max_concurrent_trials: limits use of cores in cluster computes

64
Q

What are the data guardrails for classification models in automl?

A

Class balancing detection, missing feature values imputation, high cardinality feature detection

65
Q

What do you need for a training pipeline?

A

You need the scripts to prepare the data and train the model, you need the yml files for the scripts so they can run, then you build the pipeline and run it.

66
Q

What are discrete hyperparameters?

A

Hyperparameters that have a finite set of values

67
Q

What are continuous hyperparameters?

A

Hyperparameters that can use any values along a scale, resulting in an infinite number of possibilities

68
Q

How do you set up discrete hyperparameter with Choice()?

A

You can use a python list, a range, or an arbitrary list of comma-separated values

69
Q

What are the discrete distributions that are available?

A

QUniform, QLogUniform, QNormal, QLogNormal

70
Q

What are the continuous distributions that are available?

A

Uniform, LogUniform, Normal, LogNormal

71
Q

What are the three main types of sampling?

A

Grid, Random, and Bayesian

72
Q

What is grid sampling?

A

Tries every possible combination

73
Q

What is random sampling?

A

Randomly chooses values from the search space

74
Q

What is Bayesian sampling?

A

Chooses new values based on previous results

75
Q

What is Sobol Sampling?

A

Random but with a seed so you can reproduce results

76
Q

What are some limitations of Bayesian sampling?

A

You can only use choice, uniform, and quniform parameter expressions, and you can’t use an early-termination policy.

77
Q

What is an early termination policy for?

A

So you don’t waste compute and time endlessly hyperparameterizing

78
Q

What are the two main parameters of an early termination policy?

A

evaluation_interval and delay_evaluation

79
Q

What is evaluation_interval?

A

basically how often you want to perform a termination check

80
Q

What is delay_evaluation?

A

Basically delay termination checks for a minimum number of intervals

81
Q

What are the three options for early termination policies?

A

Bandit policy, Median stopping policy, and truncation selection policy

82
Q

What is bandit policy?

A

you specify a slack amount, and the hyperparameterizing will stop if the performance is lower than the difference between the best performing run and the slack amount. You can also specify a slack factor, which will be a ratio instead of a flat number.

83
Q

What is median stopping policy?

A

It will stop the sweep when a new trial scores lower than the current median score

84
Q

What is truncation selection policy?

A

You set a truncation percentage. If, at the time of the check, the trial is within the truncation percentage, the sweep will stop. EX: If the percentage is 20%, the sweep will stop if the performance is within the worst 20% of the models thus far