AWS ML Flashcards

1
Q

What configuration must be set to allow for NVIDIA GPU training?

A

containers must be ‘nvidia-docker’ compatible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is K-NN

A

K Nearest Neighbor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it likely mean if a K-NN algorithm is producing low accuracy and precision, despite hyperparameter adjustments?

A

The numerical range differences of the variables are too high. Normalizing these numeric values can keep high magnitudes from dominating the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does it mean when, as epochs increase, the training error rate decreases exponentially which causes the model to generalize poorly.

A

Your model is overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are L1 and L2 regularization used for?

A

Preventing overfitting in training (linear models)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pulling data from Neptune, what algorithm can predict user preferences based on the patterns observed in other users?

A

Collaborative Filtering -

This uses a (user, item, rating) tuple to leverage other users experiences. This is superior to content-based filtering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What data format can help maximize the efficiency of queries?

A

Apache parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which AWS service should be used to migrate data from an on-prem MySQL DB to S3

A

Database Migration Service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What algorithm should be used to predict the sales of certain products based on time series data?

A

Sagemaker DeepAR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If you’re using Comprehend to analyze sentiments about products, and you wish to add a specific category for entity names, because they’re all currently labeled “Commercial Item”, what can you use?

A

A “Custom Entity Recognition model”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can be used for determining themes/topics from a collection of documents?

A

Topic Modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In what situation would you want to increase the dropout rate at the hidden layer

A

When you are experiencing overfitting in a Neural Network model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In Polly, how do you introduce proprietary pronunciations for your application?

A

pronunciation lexicons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What metric is used to optimize for true positives in a classification model

A

Area under the ROC curve (AUC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the best metric for assessing the accuracy of a regression model?

A

The Root Mean Squared Error (RMSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What service can provide human review of low-confidence AI/ML predictions?

A

Amazon Augmented AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the major drawback of the Kinesis Producer Library? (2)

A

It can only write to kinesis data streams, not read from them.

You should use Kinesis Client Library if reads are needed in your custom processing solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What storage solution can significantly speed up training steps for models with large datasets in S3?

A

FSx for Lustre

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What algorithm should you use for a supervised classification task (i.e. desired classifications are provided in the training data)?

A

XGBoost with the “objective” hyperparameter set to “multi:softmax”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the simplest way to use DynamoDB data in a SageMaker Jupityr instance?

A

Use Data Pipeline to export the data to the appropriate S3 location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What strategy can help improve the generalizability of a Natural Language Processing (NLP) model?

A

download a pre-trained word embedding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What kind of algorithm would you use to categorize text documents into undefined categories (i.e. an unsupervised catagorization)?

A

A Latent Dirilichlet Allocation (LDA) algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What Redshift feature can help with the direct streaming of data to redshift? What AWS service is it designed to be compatible with?

A

Redshift Streaming ingestion

Amazon Kinesis Data Stream (NOT Firehose)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the main difference between linear regression and logistic regression?

A

Linear regression can be used to predict a range of values, while logistic regression only predicts binary output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the easiest way to improve the generalizability of a binary classification model?

A

See how adjusting the “score” threshold affects the model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the simplest and cheapest AWS-native way to achieve a recommender system from data held in a Redshift cluster?

A

in-database local inference using Redshift ML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is a visualization technique to determine whether a regression model is over- or under-estimating compared to true values?

A

residual plot.

Positive residual - underestimation
negative residual - overestimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the four most common data formats used in SageMaker built-in algorithms

A

CSV
recordIO-protobuf
image files (jpg, png)
text (for BlazingText)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Between Firehose and Glue, which service can convert CSV to Apache Parquet with the least overhead?

A

Glue.

Firehose can natively convert JSON files to parquet, but not CSV files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

If you have already applied Principle Component Analysis, what is the best way to reduce the dimensionality of your data?

A

t-distributed stochastic neighbor embedding (t-SNE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

When there are missing values in a column of data, what is the best way to treat the data to produce the best representation possible?

A

Multiple Imputations by Chained Equations. (MICE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What two steps are needed to run a SageMaker TensorFlow ML project locally?

A
  1. pull the docker container
  2. install the sagemaker SDK for local development “pip install -U sagemaker”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How can you automatically reinstall external libraries when restarting a stopped SageMaker instance?

A

Use a lifecycle configuration script to bootstrap the package installation

33
Q

What can you use to create a forecast model that takes into account seasonality, linked to similar datasets, and weather conditions with the LEAST amount of development?

A

Amazon Forecast’s DeepAR+ algorithm with the built-in Weather Index featurization

34
Q

How can you greatly reduce the cost of long-running training times on a SageMaker instance?

A

Managed Spot Training with a checkpoint configuration

(allows training to stop and restart from a known checkpoint)

35
Q

What kind of algorithm specializes in generating summaries (audio, text, translation)?

A

seq2seq with encoder-decoder architecture

36
Q

Big difference between Object2vec and BlazingText classification when trying to label objects containing text?

A

Object2vec can use sentence embedding to untangle complex relationships between groups of objects, clustering them into nearest neighbor groups downstream before classifying them.

BlazingText text classification simply places a single object into one of several predefined categories.

37
Q

Algorithm for detecting multiple objects in a provided image

A

Object Detection

38
Q

What protocol does Kinesis Video Streams use to ingest video?

A

Real-Time Streaming Protocol (RTSP)

39
Q

What approach should you use when optimizing the classification threshold score?

A

Evaluate model using Receiver Operating Characteristic Curve (ROC)

40
Q

T/F - increasing the learning rate can help a model with generalization (reduce overfitting) in production

A

False - The learning rate will only impact generalization on training data

41
Q

When model is performing well in training, but not generalizing well in production, what 2 methods can be used to reduce model flexibility?

A
  1. Feature selection (use fewer features)
  2. Increase the amount of regularization used
42
Q

You want to reduce the size of a deep learning model. What is it called when you remove low-weight parameters while keeping the same architecture?

A

Model Pruning

43
Q

Formula for precision

A

TP/(TP+FP) - calculates the quality of all things labeled positive

44
Q

Formula for recall

A

TP/(TP+FN) - calculates percentage of all positive cases captured.

45
Q

2 ways for creating an Inception neural network for image classification:

A
  1. Use a TensorFlow Estimator bundled to Docker to create the Inception neural network and train the model
  2. Use a pre-built TensorFlow container image to write the inception network code and train the model
46
Q

SageMaker built-in algorithm for click prediction?

A

Factorization Machine

47
Q

Way to increase quality and reduce size of text data for a Natural Language Processing (NLP)

A

Remove stop words

48
Q

Algorithm for predicting categorization to a dependent variable based on multiple independent variables (ex. favorite ice cream flavor based on demographic info)

A

Multinomial Logistic Regression

49
Q

Neural Network Image Classification: You want to re-train a pre-trained model with your own training data (transfer learning). How?

A

Initialize the model with pre-trained weights in all layers except the output layer. Initialize output layer with random weights.

50
Q

Most cost-effective way to accelerate inference workloads (SageMaker)

A

Amazon Elastic Inference (uses many small instances instead of expensive biggies)

51
Q

When your supervised classification training data has a small minority class, how can you most effectively augment the training data?

A

Synthetic Minority Oversampling Technique (SMOTE)

52
Q

Service to do on-premise ML on video streams

A

AWS Panorama device

53
Q

What can you use to improve accuracy of text sentiment analysis when dealing with wide vocabulary and infrequent use of some terms?

A

TfIdf - Term Frequency inverse Document Frequency

OR

Scikit-learn “TfidfVectorizer” Class

54
Q

What does “Term Frequency – Inverse Document Frequency (TfIdf)” do?

A

It vectorizes text data within a corpus and gives additional weight to infrequent words while reducing the weight of very common words.

55
Q

What SageMaker feature can be used to strip an attribute from data when sent to inference, then reapply the removed attribute when showing the predicted results?

A

SageMaker Batch Transform

56
Q

Suppose the relationship between two features in a classification model is radial rather than linear or quadratic. What algorithm can give the best results?

A

Support Vector Machines (SVM) with Radial Basis Function (RBF) Kernel

57
Q

What is F1 score?

A

The harmonic average of Precision and Recall

58
Q

How do you use your custom python training script to an Amazon SageMaker notebook? (2)

A
  1. Store the training script in ‘/opt/ml/code’
  2. set it as the script entry point in the ‘SAGEMAKER_PROGRAM’ environment variable.
59
Q

What is the capacity of one shard in a Kinesis Data Stream? (input/output/records per second)

A

1MB, 2MB, 1000 rps

60
Q

Minimum number of days for an S3 data lifecycle?

A

30 days

61
Q

How can you use a custom algorithm bundled in a TensorFlow Docker container to run as an executable in a Sagemaker environment?

A

Modify the Docker file by adding the training script as an ENTRYPOINT

62
Q

How can you get a Amazon Personalize-powered preference identifying solution to continually improve/maintain its accuracy?

A

configure an event tracker in Amazon Personalize

63
Q

How can you quickly build and deploy curated recommendations and intelligent user segmentation at scale using machine learning?

A

Amazon Personalize

63
Q

How can you reduce cost and operational overhead when scaling a deep neural network that runs on GPU-based instances. The job requires a centralized queue, automatic retries, and a scheduled weekly run.

A

GPU-based spot instances in AWS batch.

63
Q

Key metric for auto-scaling sagemaker endpoints

A

InvocationsPerInstance

64
Q

In Amazon EMR clusters, (1) what are the three note types, and (2) which one can run on spot instances without the risk of data loss?

A
  1. Master
  2. Core
  3. Task - can run on spot instances to save $$
65
Q

What is the most cost effective way to host multiple versions of the same SageMaker model? (2)

A

Use a multi-model endpoint,

pass the selected TargetModel parameter

66
Q

Most cost-effective and efficient way to ensure incoming requests are properly formatted for your public SageMaker endpoint?

A

API Gateway mapping templates

67
Q

What is the key distinction between a Full Bayesian network and a Naive Bayesian network?

A

Full Bayesian is most effective when there is significant statistical correlation between some variables in your data set.

Naive Bayesian networks are more effective when there are no significant statistical correlations between variables in your data set.

68
Q

Amazon Forecast built-in algorithm for scenarios where there are complex and potentially nonlinear relationships between different factors, such as weather patterns, population density, and ridership numbers, with many historical data points?

A

Convolutional Neural Network - Quantile Regression (CNN-QR)

69
Q

Amazon Forecast built-in algorithm for scenarios where there’s a lack of historical data or when the statistical properties of the data change over time.

A

Non-parametric Time Series (NPTS)

70
Q

Amazon Forecast built-in algorithm for scenarios where there are data with clear trends and assumed linear relationships between variables.

A

Autoregressive Integrated Moving Average (ARIMA)

71
Q

Amazon Forecast built-in algorithm for scenarios where there are simple datasets and datasets with seasonality patterns.

A

Exponential Smoothing (ETS)

72
Q

Convolutional Neural Network with transfer learning on SageMaker - how can you reduce training time without compromising quality of the model?

A

Use a Sagemaker Debugger hook in the training script to identify low-ranking filters in your tensors. “model pruning”

73
Q

What feature can allow less-technical employees automate simple data processing tasks, such as normalization and filling in missing values?

A

AWS Glue DataBrew

74
Q

Which service allows your own training experts to label data for training?

A

SageMaker Ground Truth

75
Q

After setting your evaluation metric to AUC for a binary classification model, which hyperparameter can be adjusted to adjust the false negative rate?

A

scale_pos_weight

76
Q

What kind of algorithm is the Mean Average Precision (MAP) metric useful for?

A

Ranking algorithms

77
Q

What is a Keras Convolutional Neural Network (CNN) usually used for?

A

Problems dealing with image data