AWS ML Flashcards
What configuration must be set to allow for NVIDIA GPU training?
containers must be ‘nvidia-docker’ compatible
What is K-NN
K Nearest Neighbor
What does it likely mean if a K-NN algorithm is producing low accuracy and precision, despite hyperparameter adjustments?
The numerical range differences of the variables are too high. Normalizing these numeric values can keep high magnitudes from dominating the model.
What does it mean when, as epochs increase, the training error rate decreases exponentially which causes the model to generalize poorly.
Your model is overfitting
What are L1 and L2 regularization used for?
Preventing overfitting in training (linear models)
Pulling data from Neptune, what algorithm can predict user preferences based on the patterns observed in other users?
Collaborative Filtering -
This uses a (user, item, rating) tuple to leverage other users experiences. This is superior to content-based filtering.
What data format can help maximize the efficiency of queries?
Apache parquet
Which AWS service should be used to migrate data from an on-prem MySQL DB to S3
Database Migration Service
What algorithm should be used to predict the sales of certain products based on time series data?
Sagemaker DeepAR
If you’re using Comprehend to analyze sentiments about products, and you wish to add a specific category for entity names, because they’re all currently labeled “Commercial Item”, what can you use?
A “Custom Entity Recognition model”
What can be used for determining themes/topics from a collection of documents?
Topic Modeling
In what situation would you want to increase the dropout rate at the hidden layer
When you are experiencing overfitting in a Neural Network model.
In Polly, how do you introduce proprietary pronunciations for your application?
pronunciation lexicons
What metric is used to optimize for true positives in a classification model
Area under the ROC curve (AUC)
What is the best metric for assessing the accuracy of a regression model?
The Root Mean Squared Error (RMSE)
What service can provide human review of low-confidence AI/ML predictions?
Amazon Augmented AI
What is the major drawback of the Kinesis Producer Library? (2)
It can only write to kinesis data streams, not read from them.
You should use Kinesis Client Library if reads are needed in your custom processing solution.
What storage solution can significantly speed up training steps for models with large datasets in S3?
FSx for Lustre
What algorithm should you use for a supervised classification task (i.e. desired classifications are provided in the training data)?
XGBoost with the “objective” hyperparameter set to “multi:softmax”
What is the simplest way to use DynamoDB data in a SageMaker Jupityr instance?
Use Data Pipeline to export the data to the appropriate S3 location.
What strategy can help improve the generalizability of a Natural Language Processing (NLP) model?
download a pre-trained word embedding
What kind of algorithm would you use to categorize text documents into undefined categories (i.e. an unsupervised catagorization)?
A Latent Dirilichlet Allocation (LDA) algorithm
What Redshift feature can help with the direct streaming of data to redshift? What AWS service is it designed to be compatible with?
Redshift Streaming ingestion
Amazon Kinesis Data Stream (NOT Firehose)
What is the main difference between linear regression and logistic regression?
Linear regression can be used to predict a range of values, while logistic regression only predicts binary output
What is the easiest way to improve the generalizability of a binary classification model?
See how adjusting the “score” threshold affects the model performance
What is the simplest and cheapest AWS-native way to achieve a recommender system from data held in a Redshift cluster?
in-database local inference using Redshift ML
What is a visualization technique to determine whether a regression model is over- or under-estimating compared to true values?
residual plot.
Positive residual - underestimation
negative residual - overestimation
What are the four most common data formats used in SageMaker built-in algorithms
CSV
recordIO-protobuf
image files (jpg, png)
text (for BlazingText)
Between Firehose and Glue, which service can convert CSV to Apache Parquet with the least overhead?
Glue.
Firehose can natively convert JSON files to parquet, but not CSV files.
If you have already applied Principle Component Analysis, what is the best way to reduce the dimensionality of your data?
t-distributed stochastic neighbor embedding (t-SNE)
When there are missing values in a column of data, what is the best way to treat the data to produce the best representation possible?
Multiple Imputations by Chained Equations. (MICE)
What two steps are needed to run a SageMaker TensorFlow ML project locally?
- pull the docker container
- install the sagemaker SDK for local development “pip install -U sagemaker”