AWS ML Associate Flashcards by David Blocher

Performance metric: Measure the imbalance of positive outcomes between different facet values.

Difference in proportions of labels (DPL)

How well did you know this?

Not at all

Perfectly

Performance metric: Identify the difference in the predicted outcome as an input feature changes.

Partial dependence plots (PDPs)

How well did you know this?

Not at all

Perfectly

Performance metric: Quantify the contribution of each feature in a prediction.

Shapley values

How well did you know this?

Not at all

Perfectly

What should you use for data processing if it involves Tensorflow or Pytorch?

SageMaker

How well did you know this?

Not at all

Perfectly

What is the simplest way to prevent internet and data access to inference containers?

Sagemaker network isolation mode

How well did you know this?

Not at all

Perfectly

Create a baseline to monitor a Sagemaker model’s bias drift. For instance, you want it to weigh personal income over credit history for loan approval. How do you do this?

Create a SHAP baseline using the ‘ModelExplainabilityMonitor’ class. Generate a feature attribution baseline which will trigger when the observed feature attribution occurs.

How well did you know this?

Not at all

Perfectly

tool used to check for bias and explainability in datasets and models

SageMaker Clarify

How well did you know this?

Not at all

Perfectly

used to visualize and analyze intermediate tensors. Identify specific poor classifications in a CNN and make adjustments to improve model performance.

SageMaker with TensorBoard

How well did you know this?

Not at all

Perfectly

How do you strip PII from text-based user interactions

Amazon Comprehend

How well did you know this?

Not at all

Perfectly

RNN training: Exploding gradients causing a convergence issue. What feature can help address this issue?

Sagemaker Training Compiler. Optomises DL models to accelerate training by more efficiently using ML GPU instances.

How well did you know this?

Not at all

Perfectly

What instance types are supported by AWS Neuron SDKs for real-time inference on streaming video?

Inferentia instances (Inf2 family)

How well did you know this?

Not at all

Perfectly

What are used to centralize and standardize model documentation.

SageMaker Model Cards

How well did you know this?

Not at all

Perfectly

SageMaker Serverless Inference: What is the biggest consideration when deciding whether to use provisioned concurrency?

low-latency (avoiding cold-starts)

How well did you know this?

Not at all

Perfectly

(CloudWatch) What feature in the Logs Insights page is helpful in finding infrastructure monitoring through-lines in your query results?

The Patterns tab

How well did you know this?

Not at all

Perfectly

What is the primary purpose of Capacity Blocks for machine learning (ML)?

Reserve GPU instances for short-duration machine learning workloads on a future date.

How well did you know this?

Not at all

Perfectly

When using an embedded question to query a vector database for RAG, what should be returned?

The full text - not embeddings - of the nearest neighbor documents to enhance the query

How well did you know this?

Not at all

Perfectly

How can you use SageMaker Model Monitor to re-train your model?

Enable Data Capture, and use that data to retrain the model.

How well did you know this?

Not at all

Perfectly

Exploratory data visualization that can be used to identify hidden patterns, (ralationship analysis) such as an increase in specific item purchases or periods of frequent transactions

Heat Map

How well did you know this?

Not at all

Perfectly

Exploratory data visualization that helps with distribution analysis by binning ranges of data

Study These Flashcards

Histogram

Which storage service should be used when you need concurrent access from multiple Amazon EC2 instances to a Windows File Server for distributed training of the model.

Study These Flashcards

FSx

Best way to improve Kinesis ingestion performance

Study These Flashcards

Batching

What type of variables should be on the x-axis of a bar chart?

Study These Flashcards

Categorical variables

Data format for ingesting streamed, unstructured data

Study These Flashcards

JSON lines

When to choose Lustre over EFS

Study These Flashcards

Only when super high performace, high volumes and extremely low latency is the requirement. EFS is go-to for distributed training and S3 is go-to for general storage

Which text feature engineering technique would categorize customer feedback as positive or negative?

N-gram

How can you reduce the dimensionality of the data while retaining most of the variation?

Principle Component Analysis

How can you save time by storing curated features that can be accessed to train new models.

SageMaker Feature Store

When to use EMR over Glue?

real-time shit. Glue is only for batch ETL

What ETL solution allows you to do anomaly detection on real-time data?

Apache Spark on EMR

How do you address class imbalances in text-based datasets?

text-based data augmentation, like synonym replacement, or text paraphrasing

How can you use AWS Glue Data Quality to assess data before training an ML model?

data validation rules

What does a Class Imbalance metric of .9 mean?

It means that the advantaged group is overrepresented in the data, and they are HIGHLY advantaged.

Which data formats do most Amazon SageMaker algorithms support for training?

CSV and RecordIO-protobuf

What is used to compare the distribution of labels in your data to the expected proportions?

Difference in Proportion of Labels (DPL)

Which built-in algorithm is used for text classification and Word2Vec?

BlazingText

Which built-in algorithm is a great choice for a supervised text translation model?

Sequence-to-Sequence algorithm

Which deep learning frameworks are supported in Amazon SageMaker reinforcement learning (RL)? (2)

TensorFlow and Apache MXNet

Difference between Lex v2 and Kendra?

Lex is a chat bot, Kendra is natural language search.

What approach is referred as script mode when using Amazon SageMaker?

Using pre-set framework and dependencies, but providing your own custom training algorithms.

Training: What is the most appropriate data ingestion mode for a large data set of historical data?

Pipe mode. File mode involves ingesting the whole set at once and will not be as performative as streaming with pipe mode. Fast File mode is used for sequential data

The model has billions of parameters, and training it on a single GPU would be infeasible due to memory constraints. How fix?

Model parallelism

Main Use-Cases for Trainium instances?

large language models and natural language processing (NLP) training

What is it called when you use multiple models of different types and aggragate their predictions in a heterogeneous model group?

Stacking

Which hyperparameter tuning method is best for finding optimum hyperparameter values with limited compute resources?

Hyperband

CNN is failing to generalize well, although it performs well on training data. What method can be used to help it adapt to unforeseen patterns?

dropout

AWS ML Associate Flashcards

(45 cards)