ML Fundamentals Flashcards
allows people to store objects (files) in “buckets”
(directories)
Amazon S3
What pathway is this called: * <my_bucket>/my_folder1/another_folder/my_file.txt</my_bucket>
S3 Bucket Key
- Pattern for speeding up range queries (ex: AWS Athena)
- By Date: s3://bucket/my-dataset/year/month/day/hour/data_00.csv
- By Product: s3://bucket/my-data-set/product-id/data_32.csv
Amazon S3 Data Partitioning
Durability or availability:
* If you store 10,000,000 objects with Amazon S3, you can on average
expect to incur a loss of a single object once every 10,000 years
* Same for all storage classes
Durability
Durability or availability:
* Measures how readily available a service is
* Varies depending on storage class
Availability
What S3 storage class is the below:
* 99.99% Availability
* Used for frequently accessed data
* Low latency and high throughput
* Sustain 2 concurrent facility failures
* Use Cases: Big Data analytics, mobile & gaming applications,
content distribution…
S3 Standard – General Purpose
What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
** 99.9% Availability
* Use cases: Disaster Recovery, backups
- Amazon S3 Standard-Infrequent Access (S3 Standard-IA)
What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
* High durability (99.999999999%) in a single AZ; data lost when AZ is destroyed
* 99.5% Availability
* Use Cases: Storing secondary backup copies of on-premise data, or data you
can recreate
- Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA)
What S3 Storage class:
Small monthly monitoring and auto-tiering fee
* Moves objects automatically between Access Tiers based on usage
* There are no retrieval charges in S3 Intelligent-Tiering
S3 Intelligent-Tiering
Describe the S3 storage Intelligent Tiering classes below:
*__________: default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* ______: objects not accessed for 90 days
* _________: configurable from 90 days to 700+ days
* ________: config. from 180 days to 700+ days
Frequent Access tier (automatic): default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* Archive Instant Access tier (automatic): objects not accessed for 90 days
* Archive Access tier (optional): configurable from 90 days to 700+ days
* Deep Archive Access tier (optional): config. from 180 days to 700+ days
- Help you decide when to transition objects
to the right storage class - Recommendations for Standard and
Standard IA - Does NOT work for One-Zone IA or Glacier
- Report is updated daily
- 24 to 48 hours to start seeing data analysis
- Good first step to put together Lifecycle
Rules (or improve them)!
Amazon S3 Analytics
bucket wide rules from the S3 console - allows cross account
S3 Bucket policies
_____ is a managed alternative to Apache Kafka
* Great for application logs, metrics, IoT, clickstreams
* Great for “real-time” big data
* Great for streaming processing frameworks (Spark, NiFi, etc…)
* Data is automatically replicated synchronously to 3 AZ
Amazon Kinesis
__________ low latency streaming ingest at scale
Kinesis Streams
________ perform real-time analytics on streams using SQL
Kinesis Analytics
_________ load streams into S3, Redshift, ElasticSearch & Splunk
Kinesis Firehose
______ meant for streaming video in real-time
Kinesis Video Streams
Kinesis Streams are divided in ordered ______
Shards
What are the two capacity modes for Kinesis Data streams?
Provisioned and On-Demand modes
What Kinesis data stream capacity mode is below:
*You choose the number of shards provisioned, scale manually or using API
* Each shard gets 1MB/s in (or 1000 records per second)
* Each shard gets 2MB/s out (classic or enhanced fan-out consumer)
* You pay per shard provisioned per hour
Provisioned
What Kinesis data stream capacity mode is below:
* No need to provision or manage the capacity
* Default capacity provisioned (4 MB/s in or 4000 records per second)
* Scales automatically based on observed throughput peak during the last 30
days
* Pay per stream per hour & data in/out per GB
On-demand mode
What Kinesis service is this:
*Fully Managed Service, no administration
* Near Real Time (60 seconds latency minimum for non full batches)
* Data Ingestion into Redshift / Amazon S3 / ElasticSearch / Splunk
* Automatic scaling
* Supports many data formats
* Data Conversions from CSV / JSON to Parquet / ORC (only for S3)
* Data Transformation through AWS Lambda (ex: CSV => JSON)
* Supports compression when target is Amazon S3 (GZIP, ZIP, and
SNAPPY
Kinesis data firehose
Whats the difference between kinesis data streams and firehose?
*Streams
* Going to write custom code (producer / consumer)
* Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)
* Automatic scaling with On-demand Mode
* Data Storage for 1 to 365 days, replay capability, multi consumers
*Firehose
* Fully managed, send to S3, Splunk, Redshift, ElasticSearch
* Serverless data transformations with Lambda
* Near real time (lowest buffer time is 1 minute)
* Automated Scaling
* No data storage
What Kinesis tool is this:
Use cases
* Streaming ETL: select columns, make simple transformations, on streaming
data
* Continuous metric generation: live leaderboard for a mobile game
* Responsive analytics: look for certain criteria and build alerting (filtering)
* Features
* Pay only for resources consumed (but it’s not cheap)
* Serverless; scales automatically
* Use IAM permissions to access streaming source and destination(s)
* SQL or Flink to write the computation
* Schema discovery
* Lambda can be used for pre-processing
Kinesis data analytics
For Kinesis Analytics, you Pay only for ______ (but it’s not cheap)
resources consumed
Is amazon kinesis serverless?
Yes
What amazon data product has the below characteristics:
- Producers:
- security camera, body-worn camera,
AWS DeepLens, smartphone
camera, audio feeds, images,
RADAR data, RTSP camera. - One producer per video stream
- Video playback capability
- Consumers
- build your own (MXNet, Tensorflow)
- AWS SageMaker
- Amazon Rekognition Video
- Keep data for 1 hour to 10 years
Kinesis video stream
__________ create real-time machine learning
applications
Kinesis Data Stream
_____ ingest massive data near-real time
Kinesis Data Firehose
___________ real-time ETL / ML algorithms on
streams
Kinesis Data Analytics
___________ real-time video stream to create ML
applications
Kinesis Video Stream
- Metadata repository for all
your tables - Automated Schema
Inference - Schemas are versioned
- Integrates with Athena or
Redshift Spectrum
(schema & data discovery)
Glue data catalog
____ go through your data to infer schemas and partitions
* Works JSON, Parquet, CSV, relational store
Glue crawlers
Transform data, Clean Data, Enrich Data (before doing analysis)
* Generate ETL code in Python or Scala, you can modify the code
* Can provide your own Spark or PySpark scripts
* Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
* Fully managed, cost effective, pay only for the resources consumed
* Jobs are run on a serverless Spark platform
Glue ETL
What type of data store is this:
Data Warehousing, SQL
analytics (OLAP - Online
analytical processing)
Redshift
What type of data store is this:
Relational Store, SQL (OLTP -
Online Transaction Processing)
* Must provision servers in
advance
- RDS, Aurora:
What type of data store is this:
NoSQL data store, serverless,
provision read/write capacity
* Useful to store a machine
learning model served by your
application
- DynamoDB:
What type of data store is this:
Object storage
* Serverless, infinite storage
* Integration with most AWS
Services
S3
What type of data storoe is this:
- Indexing of data
- Search amongst data points
- Clickstream Analytics
OpenSearch (previously
ElasticSearch)
What type of data store is this:
- Caching mechanism
- Not really used for Machine
Learning
- ElastiCache
What are these below features identifying what AWS data service:
Destinations include S3, RDS,
DynamoDB, Redshift and EMR
* Manages task dependencies
* Retries and notifies on failures
* Data sources may be on-premises
* Highly available
AWS Data Pipeline
What are the differences between AWS Data Pipeline and AWS Glue?
Glue:
* Glue ETL - Run Apache Spark code, Scala or Python based, focus on the
ETL
* Glue ETL - Do not worry about configuring or managing the resources
* Data Catalog to make the data available to Athena or Redshift Spectrum
* Data Pipeline:
* Orchestration service
* More control over the environment, compute resources that run code, & code
* Allows access to EC2 or EMR instances (creates resources in your own
account)
What AWS data service is below:
- Run batch jobs as Docker images
- Dynamic provisioning of the instances (EC2 & Spot Instances)
- Optimal quantity and type based on volume and requirements
- No need to manage clusters, fully serverless
- You just pay for the underlying EC2 instances
AWS Batch
What is the difference between AWS Batch and Glue?
- Glue:
- Glue ETL - Run Apache Spark code, Scala or Python based, focus on
the ETL - Glue ETL - Do not worry about configuring or managing the resources
- Data Catalog to make the data available to Athena or Redshift
Spectrum - Batch:
- For any computing job regardless of the job (must provide Docker
image) - Resources are created in your account, managed by Batch
- For any non-ETL related work, Batch is probably better
What AWS data service has the below features:
- Quickly and securely migrate databases
to AWS, resilient, self healing - The source database remains available
during the migration - Supports:
- Homogeneous migrations: ex Oracle to
Oracle - Heterogeneous migrations: ex Microsoft SQL
Server to Aurora - Continuous Data Replication using CDC
- You must create an EC2 instance to
perform the replication tasks
AWS Database Migration Service - DMS
What is the difference between AWS DMS and Glue?
Glue:
* Glue ETL - Run Apache Spark code, Scala or Python based, focus on
the ETL
* Glue ETL - Do not worry about configuring or managing the resources
* Data Catalog to make the data available to Athena or Redshift
Spectrum
* AWS DMS:
* Continuous Data Replication
* No data transformation
* Once the data is in AWS, you can use Glue to transform it
What AWS Data service has the below features:
For data migrations from on-premises to AWS storage services
* A DataSync Agent is deployed as a VM and connects to your
internal storage
* NFS, SMB, HDFS
* Encryption and data validation
AWS DataSync
- An Internet of Things (IOT) thing
- Standard messaging protocol
- Think of it as how lots of sensor
data might get transferred to your
machine learning model - The AWS IoT Device SDK can
connect via ____
MQTT
What are the three major types of data?
- Numerical
- Categorical
- Ordinal
______ Represents some sort of quantitative
measurement
* Heights of people, page load times, stock
prices, etc.
Numerical
_______ is Integer based; often counts of some event.
* How many purchases did a customer make in a
year?
* How many times did I flip “heads”?
Discrete data
__________
* Has an infinite number of possible values
* How much time did it take for a user to check
out?
* How much rain fell on a given day?
- Continuous Data
___________ is Qualitative data that has no
inherent mathematical meaning
* Gender, Yes/no (binary data),
Race, State of Residence, Product
Category, Political Party, etc.
Categorical data
A mixture of numerical and
categorical
* Categorical data that has
mathematical meaning
* Example: movie ratings on a 1-5
scale.
* Ratings must be 1, 2, 3, 4, or 5
* But these values have mathematical
meaning; 1 means it’s a worse movie
than a 2.
Ordinal data
What AWS service has the below characteristics:
- Interactive query service for S3 (SQL)
- No need to load data, it stays in S3
- Presto under the hood
- Serverless!
- Supports many data formats
- CSV (human readable)
- JSON (human readable)
- ORC (columnar, splittable)
- Parquet (columnar, splittable)
- Avro (splittable)
- Unstructured, semi-structured, or structured
Amazon athena
What AWS service uses the below scenarios?
- Ad-hoc queries of web logs
- Querying staging data before
loading to Redshift - Analyze CloudTrail / CloudFront /
VPC / ELB etc logs in S3 - Integration with Jupyter, Zeppelin,
RStudio notebooks - Integration with QuickSight
- Integration via ODBC / JDBC with
other visualization tools
amazon athena
What AWS service has the below cost model?
Pay-as-you-go
* $5 per TB scanned
* Successful or cancelled queries
count, failed queries do not.
* No charge for DDL
(CREATE/ALTER/DROP etc.)
* Save LOTS of money by using
columnar formats
* ORC, Parquet
* Save 30-90%, and get better
performance
Athena
What AWS Service has the below characteristics:
- Fast, easy, cloud-powered business
analytics service - Allows all employees in an organization
to: - Build visualizations
- Perform ad-hoc analysis
- Quickly get business insights from data
- Anytime, on any device (browsers, mobile)
- Serverless
Quicksight
What is the in memory database that is used by quicksight?
SPICE
What quicksight service is below:
Machine learning-powered
* Answers business questions with Natural
Language Processing
* “What are the top-selling items in Florida?”
* Offered as an add-on for given regions
* Personal training on how to use it is
required
* Must set up topics associated with
datasets
* Datasets and their fields must be NLP-friendly
* How to handle dates must be defined
Quicksight Q
What quicksight service is below:
Reports designed to
be printed
* May span many pages
* Can be based on
existing Quicksight
dashboards
* New in Nov 2022
Paginated Reports
What AWS Service is this:
- Managed Hadoop framework on EC2
instances - Includes Spark, HBase, Presto, Flink,
Hive & more - EMR Notebooks
- Several integration points with AWS
Amazon EMR (Elastic Map Reduce)
What is this called:
Applying your knowledge of the data – and the model you’re
using - to create better features to train your model with.
* Which features should I use?
* Do I need to transform these features in some way?
* How do I handle missing data?
* Should I create new features from the existing ones?
Feature engineering
What is The Curse of Dimensionality
?
Too many features can be a problem –
leads to sparse data
* Every feature is a new dimension
* Much of feature engineering is selecting
the features most relevant to the
problem at hand
* This often is where domain knowledge
comes into play
What AI data cleansing concept is below:
Replace missing values with the mean value
from the rest of the column (columns, not rows!
A column represents a single feature; it only
makes sense to take the mean from other
samples of the same feature.)
* Fast & easy, won’t affect mean or sample size
of overall data set
* Median may be a better choice than mean
when outliers are present
Mean replacement
What are the cons of mean replacement?
Only works on column level, misses correlations
between features
* Can’t use on categorical features (imputing with
most frequent value can work in this case, though)
* Not very accurate
What solution to missing data is this :
If not many rows contain missing data…
* …and dropping those rows doesn’t bias your
data…
* …and you don’t have a lot of time…
* …maybe it’s a reasonable thing to do.
* But, it’s never going to be the right
answer for the “best” approach.
Dropping data
What are the three ways to solve missing data with machine learning techniques?
*KNN: Find K “nearest” (most similar) rows and average their values
* Assumes numerical data, not categorical
* There are ways to handle categorical data (Hamming distance), but
categorical data is probably better served by…
* Deep Learning
* Build a machine learning model to impute data for your machine learning
model!
* Works well for categorical data. Really well. But it’s complicated.
* Regression
* Find linear or non-linear relationships between the missing feature and other
features
* Most advanced technique: MICE (Multiple Imputation by Chained Equations)
What kind of data is this:
Large discrepancy between
“positive” and “negative”
cases
* i.e., fraud detection. Fraud is
rare, and most rows will be notfraud
* Don’t let the terminology
confuse you; “positive” doesn’t
mean “good”
* It means the thing you’re testing
for is what happened.
* If your machine learning model
is made to detect fraud, then
fraud is the positive case.
* Mainly a problem with neural
networks
unbalanced data
To improve AI Data quality, what is the term below:
Artificially generate new samples of the minority class using
nearest neighbors
* Run K-nearest-neighbors of each sample of the minority class
* Create a new sample from the KNN result (mean of the neighbors)
* Both generates new samples and undersamples majority class
* Generally better than just oversampling
SMOTE (* Synthetic Minority Over-sampling TEchnique)
If you have too many false positives, one
way to fix that is to simply increase that
_________
threshold
_____ is simply the average of the squared
differences from the mean
Variance
_____ is just the square root
of the variance.
Standard Deviation 𝜎
Bucket observations together based
on ranges of values.
* Example: estimated ages of people
* Put all 20-somethings in one
classification, 30-somethings in another,
etc
Binning
Applying some function to a feature to make it
better suited for training
Transforming
Transforming data into some new
representation required by the
model
encoding
Some models prefer feature data to be
normally distributed around 0 (most
neural nets)
* Most models require feature data to at
least be scaled to comparable values
* Otherwise features with larger magnitudes
will have more weight than they should
* Example: modeling age and income as
features – incomes will be much higher
values than ages
Scaling/normalization
Many algorithms benefit from
_____ their training data
* Otherwise they may learn from
residual signals in the training
data resulting from the order in
which they were collected
shuffling
What is Ground Truth?
- Ground Truth manages humans who
will label your data for training
purposes - Ground Truth creates its own model as images are labeled by
people - As this model learns, only images the model isn’t sure about are
sent to human labelers
Turnkey solution
* “Our team of AWS Experts”
manages the workflow and team of labelers
* You fill out an intake form
* They contact you and discuss
pricing
Ground truth plus
- AWS service for image recognition
- Automatically classify images
Rekognition
- AWS service for text analysis and topic modeling
- Automatically classify text by topics, sentiment
Comprehend
- Important data for search – figures out what terms are most relevant for a document
*
TF-IDF
* Stands for Term Frequency and Inverse Document Frequency
- just measures how often a word occurs in a
document - A word that occurs frequently is probably important to that document’s
meaning
Term Frequency
_____ is how often a word occurs in an entire
set of documents, i.e., all of Wikipedia or every web page
* This tells us about common words that just appear everywhere no
matter what the topic, like “a”, “the”, “and”, et
Document Frequency
Can you explain bi grams and tri grams?
An extension of TF-IDF is to not only compute relevancy for
individual words (terms) but also for bi-grams or, more
generally, n-grams.
* “I love certification exams”
* Unigrams: “I”, “love”, “certification”, “exams”
* Bi-grams: “I love”, “love certification”, “certification exams”
* Tri-grams: “I love certification”, “love certification exams”
What are the three types of neural networks?
- Feedforward Neural Network
- Convolutional Neural Networks
(CNN) - Recurrent Neural Networks
(RNNs)
What kind of activation function is this:
It doesn’t really do
anything
* Can’t do backpropagation
Linear
What kind of activation function is this:
- It’s on or off
- Can’t handle multiple
classification – it’s binary
after all - Vertical slopes don’t
work well with calculus!
Binary step function
What kind of activation function is this:
- These can create complex mappings between inputs and
outputs - Allow backpropagation (because they have a useful derivative)
- Allow for multiple layers (linear functions degenerate to a single
layer)
Non linear activation function
What kind of activation function is this:
- Nice & smooth
- Scales everything from 0-1
(Sigmoid / Logistic) or -1 to 1
(tanh / hyperbolic tangent) - But: changes slowly for high
or low values - The “Vanishing Gradient”
problem - Computationally expensive
- Tanh generally preferred over
sigmoid
Sigmoid / Logistic / TanH
What kind of activation function is this:
Now we’re talking
* Very popular choice
* Easy & fast to
compute
* But, when inputs are
zero or negative, we
have a linear function
and all of its
problems
Rectified Linear Unit (ReLU)
What kind of activation function is this:
Solves “dying ReLU” by
introducing a negative
slope below 0 (usually not
as steep as this)
Leaky ReLU
What kind of activation function is this:
- ReLU, but the slope in the
negative part is learned
via backpropagation - Complicated and YMMV
Parametric ReLU (PReLU)
What kind of activation function is this:
- From Google, performs really well
- But it’s from Google, not Amazon…
- Mostly a benefit with very deep
networks (40+ layers)
Swish
What kind of activation function is this:
- Outputs the max of the inputs
- Technically ReLU is a special case
of maxout - But doubles parameters that need to
be trained, not often practical.
Maxout
- Used on the final output layer of a
multi-class classification problem - Basically converts outputs to
probabilities of each classification - Can’t produce more than one label
for something (sigmoid can)
Softmax
What are convolutional neural networks used for?
When you have data that doesn’t
neatly align into columns
* Images that you want to find features
within
* Machine translation
* Sentence classification
* Sentiment analysis
* They can find features that aren’t in a
specific spot
* Like a stop sign in a picture
* Or words within a sentence
* They are “feature-location invariant”
_________
They can find features that aren’t in a
specific spot
* Like a stop sign in a picture
* Or words within a sentence
convolutional neural network
True or false:
CNNs are very resource-intensive (CPU, GPU,
and RAM)
true
What are recurrent neural networks used for?
Time-series data
* When you want to predict future behavior based
on past behavior
* Web logs, sensor logs, stock trades
* Where to drive your self-driving car based on
past trajectories
* Data that consists of sequences of arbitrary
length
* Machine translation
* Image captions
* Machine-generated music
What neural network should you use:
- Time-series data
- When you want to predict future behavior based
on past behavior - Web logs, sensor logs, stock trades
- Where to drive you
recurrent neural network
________ deep learning architectures
are what’s hot
* Adopts mechanism of “self-attention”
* Weighs significance of each part of the input data
* Processes sequential data (like words, like an RNN),
but processes entire input all at once.
* The attention mechanism provides context, so no
need to process one word at a time.
* BERT, RoBERTa, T5, GPT-2 etc., DistilBERT
* DistilBERT: uses knowledge distillation to reduce
model size by 40%
Transformer
What is it called when the below things are used in AI?
- NLP models (and others) are too big
and complex to build from scratch
and re-train every time - The latest may have hundreds of billions
of parameters! - Model zoos such as Hugging Face
offer pre-trained models to start from - Integrated with Sagemaker via Hugging
Face Deep Learning Containers - You can fine-tune these models for
your own use cases
transfer learning
Neural networks are trained
by ________ (or
similar means)
gradient descent
- Too high a learning rate
means you might _________
overshoot
the optimal solution!
- Too small a learning rate will
_____
take too long to find the
optimal solution
Learning rate is an example
of a ___________
hyperparameter
Smaller batch sizes can work their way out of _________
“local minima” more
easily
- Batch sizes that are too large can ________
end up getting stuck in the wrong solution
- Regularization techniques are
intended to prevent ________.
overfitting
true or false:
Overfitted models have learned patterns
in the training data that don’t generalize to
the real world
true
- Models that are good at making predictions on the data they were trained on, but not on new data it hasn’t seen
before
overfitting
What is the vanishing gradient problem?
When the slope of the learning
curve approaches zero, things
can get stuck
_ regularization: sum of weights
* Performs feature selection – entire features go to 0
* Computationally inefficient
* Sparse output
L1 regularization
__ regularization: sum of square of weights
* All features remain considered, just weighted
* Computationally efficient
* Dense output
L2 regularization
What matrix does the below show?
- A test for a rare disease can be
99.9% accurate by just guessing
“no” all the time - We need to understand true
positives and true negative, as well
as false positives and false
negatives.
the confusion matrix
____ = AKA Sensitivity, True Positive rate, Completeness
* Percent of positives rightly predicted
* Good choice of metric when you care a lot
about false negatives
recall
What is the formula for recall?
𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆/
(𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆+𝐹𝐴𝐿𝑆𝐸 𝑁𝐸𝐺𝐴𝑇𝐼𝑉𝐸)
____ = AKA Correct Positives
* Percent of relevant results
* Good choice of metric when you care a lot
about false positives
* i.e., medical screening, drug testing
precision
What is the formula for precision?
𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆 /
(𝑇𝑅𝑈𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆+𝐹𝐴𝐿𝑆𝐸 𝑃𝑂𝑆𝐼𝑇𝐼𝑉𝐸𝑆)
- Plot of true positive rate (recall) vs. false
positive rate at various threshold settings. - Points above the diagonal represent good classification (better than random)
- Ideal curve would just be a point in the upper-left corner
- The more it’s “bent” toward the upper-left, the better
ROC Curve
- Receiver Operating Characteristic Curve
Equal to probability that a classifier will rank
a randomly chosen positive instance higher
than a randomly chosen negative one
* ROC AUC of 0.5 is a useless classifier, 1.0
is perfect
* Commonly used metric for comparing
classifiers
- Area Under the Curve (AUC)
Good = higher area under
curve
* Similar to ROC curve
* But better suited for information
retrieval problems
* ROC can result in very small
values if you are searching
large number of documents for
a tiny number that are relevant
- Precision / Recall curve
__________ = Generate N new training sets by random sampling with replacement
* Each resampled model can be trained in parallel
bagging
_____ = * Observations are weighted
* Some will take part in new training sets more often
* Training is sequential; each classifier takes into account the
previous one’s success.
boosting
What type of sagemaker built in algorithm is this:
Linear regression
* Fit a line to your training data
* Predications based on that line
* Can handle both regression
(numeric) predictions and
classification predictions
* For classification, a linear threshold
function is used.
* Can do binary or multi-class
Linear learner
For linear learner, it can handle both regression
(numeric) predictions and
_______ predictions
classification predictions
Linear Learner: What training input does it expect?
- RecordIO-wrapped protobuf
- CSV
- File or Pipe mode both supported
Linear learner:
Preprocessing
* Training data must be ______(so all features
are weighted the same)
* Linear Learner can do this for you automatically
normalized
What does sagemaker linear learner use in training?
Uses stochastic gradient descent
What type of sagemaker built in algorithm is this:
Boosted group of decision trees
* New trees made to correct the errors of
previous trees
* Uses gradient descent to minimize loss as
new trees are added
XGBoost
What type of training input does xgboost expect?
it takes CSV or libsvm input.
With xgboost, Models are serialized/deserialized with ___
Pickle
What type of sagemaker built in algorithm is this:
- Input is a sequence of tokens,
output is a sequence of tokens - Machine Translation
- Text summarization
- Speech to text
- Implemented with RNN’s and CNN’s with attention
Seq2Seq
What sagemaker built in algorithm maps to the below training inputs :
- RecordIO-Protobuf
- Tokens must be integers (this is unusual, since most algorithms want floating point
data.) - Start with tokenized text files
- Convert to protobuf using sample code
- Packs into integer tensors with
vocabulary files - A lot like the TF/IDF lab we did earlier.
- Must provide training data, validation data, and vocabulary files.
Seq2Seq
Seq2Seq can optimize on :
- Accuracy
-Vs. provided validation dataset - __ score
- Compares against multiple reference translations
- Perplexity
- Cross-entropy
BLEU score
Seq2Seq: Instance Types
Can only use ____ instance types
(P3 for example)
* Can only use a single machine for training
* But can use multi-GPU’s on one machine
GPU instance types
What sagemaker algorithm has the below characteristics?
- Forecasting one-dimensional time series data
- Uses RNN’s
- Allows you to train the same model over several related time series
- Finds frequencies and seasonality
DeepAR
What sagemaker algorithm has the below training input needs?
JSON lines format
* Gzip or Parquet
* Each record must contain:
* Start: the starting time stamp
* Target: the time series values
* Each record can contain:
* Dynamic_feat: dynamic features (such as, was
a promotion applied to a product in a time series of product purchases)
* Cat: categorical features
DeepAR
For DeepAR, Always include entire _____ for
training, testing, and inference
time series
For deepAR, start with ___, move up to __ if necessary.
CPU, GPU
What sagemaker algorithm has the below characteristics:
- Text classification
- Predict labels for a sentence
- Useful in web searches, information retrieval
- Supervised
- Word2vec
- Creates a vector representation of words
- Semantically similar words are represented by vectors close to each other
- This is called a word embedding
- It is useful for NLP, but is not an NLP algorithm
in itself! - Used in machine translation, sentiment analysis
- Remember it only works on individual words, not sentences or documents
BlazingText
BlazingText: What training input does it expect?
- For supervised mode (text classification):
- One sentence per line
- First “word” in the sentence is the string __label__ followed by the label
- Also, “augmented manifest text format”
- Word2vec just wants a text file with one training sentence per line.
What type of sagemaker algorithm is below:
- It creates low-dimensional dense embeddings of high-dimensional objects
- It is basically word2vec, generalized to handle things other than words.
- Compute nearest neighbors of objects
- Visualize clusters
- Genre prediction
- Recommendations (similar items or
users)
Object2Vec
What type of algorithm has the below training requirements:
- Data must be tokenized into integers
- Training data consists of pairs of tokens and/or sequences of tokens
- Sentence – sentence
- Labels-sequence (genre to description?)
- Customer-customer
- Product-product
- User-item
Object2Vec
For object2vec, you Process data into ___ and shuffle it
JSON Lines
What are important hyperparameters for Object2Vec?
- The usual deep learning ones…
- Dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay
- Enc1_network, enc2_network
- Choose hcnn, bilstm, pooled_embedding
What sagemaker algorithm is below:
- Identify all objects in an image with
bounding boxes - Detects and classifies objects with a
single deep neural network - Classes are accompanied by
confidence scores - Can train from scratch, or use pretrained models based on ImageNet
object detection
What are the two variants of sagemaker object detection?
MXNet and Tensorflow
* Takes an image as input, outputs all instances of objects in the image with categories and
confidence scores
* MXNet
* Uses a CNN with the Single Shot multibox Detector (SSD) algorithm
* The base CNN can be VGG-16 or ResNet-50
* Transfer learning mode / incremental training
* Use a pre-trained model for the base network weights,
instead of random initial weights
* Uses flip, rescale, and jitter internally to avoid overfitting
* Tensorflow
* Uses ResNet, EfficientNet, MobileNet models from
the TensorFlow Model Garden
What training input does object detection expect?
- MXNet: RecordIO or image format (jpg or png)
- With image format, supply a JSON file for annotation data for each image
Whats the difference between object detection and image classification?
Object detection will show the specific point in the image where the object is. Image classification will classify the image and tell you what it is, not where it is
Image Classification: What’s it for?
- Assign one or more labels to an
image - Doesn’t tell you where objects are, just what objects are in the image
For image classification , there are Separate algorithms for ________ and _____
MXNet and Tensorflow
Semantic Segmentation: What’s it for?
- Pixel-level object classification
- Different from image classification –
that assigns labels to whole images - Different from object detection – that
assigns labels to bounding boxes - Useful for self-driving vehicles,
medical imaging diagnostics, robot sensing
- Useful for self-driving vehicles,
medical imaging diagnostics, robot sensing
semantic segmentation
Semantic Segmentation: What training input does it expect?
- JPG Images and PNG annotations
- For both training and validation
- Label maps to describe annotations
- Augmented manifest image format
supported for Pipe mode. - JPG images accepted for inference
What form of sagemaker algorithm tool has the below choices:
Choice of 3 algorithms:
* Fully-Convolutional Network (FCN)
* Pyramid Scene Parsing (PSP)
* DeepLabV3
semantic segmentation
Random cut forest us used for ________
anomaly detection
Neural Topic Model: What’s it for?
- Organize documents into topics
- Classify or summarize documents
based on topics - It’s not just TF/IDF
- “bike”, “car”, “train”, “mileage”, and
“speed” might classify a document as
“transportation” for example (although it
wouldn’t know to call it that)
What are the four data channels for neural topic model?
- Four data channels
- “train” is required
- “validation”, “test”, and “auxiliary” optional
Neural Topic Model: How is it used?
- You define how many topics you want
- These topics are a latent representation
based on top ranking words - One of two topic modeling algorithms in
SageMaker – you can try them both!
Another topic modeling algorithm
* Not deep learning
* Unsupervised
* The topics themselves are unlabeled; they are just groupings of documents
with a shared subset of words
* Can be used for things other than words
* Cluster customers based on purchases
* Harmonic analysis in music
- Latent Dirichlet Allocation (LDA)
What sagemaker algorithm:
Unsupervised; generates however many topics you specify
* Optional test channel can be used for scoring results
* Per-word log likelihood
* Functionally similar to NTM, but CPU-based
* Therefore maybe cheaper / more efficient
- Latent Dirichlet Allocation (LDA)
Simple classification or regression algorithm
* Classification
* Find the K closest points to a sample point and return the most frequent label
* Regression
* Find the K closest points to a sample point and return the average value
- K-Nearest-Neighbors - KNN
for KNN: SageMaker includes a ___________ stage
* Avoid sparse data (“curse of dimensionality”)
* At cost of noise / accuracy
* “sign” or “fjlt” methods
dimensionality reduction
These are important hyperparameters for what algorithm:
- K!
- Sample_size
KNN
What sagemaker algorithm:
- Unsupervised clustering
- Divide data into K groups, where members of a group are as similar as possible to each other
- You define what “similar” means
- Measured by Euclidean distance
- Web-scale K-Means clustering
K Means
These are important hyperparameters for what algorithm:
- K!
- Choosing K is tricky
- Plot within-cluster sum of squares as function of K
- Use “elbow method”
- Basically optimize for tightness of clusters
- Mini_batch_size
- Extra_center_factor
- Init_method
K means
What is the below sagemaker algorithm:
- Dimensionality reduction
- Project higher-dimensional data (lots of features) into lower-dimensional (like a
2D plot) while minimizing loss of information - The reduced dimensions are called components
- First component has largest possible variability
- Second component has the next largest…
- Unsupervised
- Principal Component Analysis
PCA
PCA: What training input does it expect?
- recordIO-protobuf or CSV
- File or Pipe on either
What sagemaker algorithm:
- Covariance matrix is created, then singular value decomposition (SVD)
Two modes: - Regular
- For sparse data and moderate number of observations and features
- Randomized
- For large number of observations and features
- Uses approximation algorithm
- Principal Component Analysis
PCA
What sagemaker algorithm:
Dealing with sparse data
* Click prediction
* Item recommendations
* Since an individual user doesn’t interact with most pages / products the data is sparse
* Supervised
* Classification or regression
* Limited to pair-wise interactions
* User -> item for example
factorization machines
What sagemaker algorithm:
Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a
matrix representing some pair of things (users & items?)
* Usually used in the context of
recommender systems
factorization machines
What sagemaker algorithm:
* Unsupervised learning of IP address usage patterns
* Identifies suspicious behavior from IP addresses
* Identify logins from anomalous IP’s
* Identify accounts creating resources from anomalous IP’s
IP Insights
What sagemaker algorithm:
- Uses a neural network to learn latent vector representations of entities and IP addresses.
- Entities are hashed and embedded
- Need sufficiently large hash size
- Automatically generates negative samples during training by randomly pairing entities and IP’s
IP Insights
What sagemaker algorithm:
- You have some sort of agent that “explores” some space
- As it goes, it learns the value of different state changes in different conditions
- Those values inform subsequent behavior of the agent
- Examples: Pac-Man, Cat & Mouse game (game AI)
- Supply chain management
- HVAC systems
- Industrial robotics
- Dialog systems
- Autonomous vehicles
- Yields fast on-line performance once the space has been explored
reinforcement learning
What sagemaker algorithm:
- A specific implementation of reinforcement learning
- You have:
- A set of environmental states s
- A set of possible actions in those states a
- A value of each state/action Q
- Start off with Q values of 0
- Explore the space
- As bad things happen after a given state/action, reduce its Q
- As rewards happen after a given state/action, increase its Q
q learning
Reinforcement Learning in SageMaker
* Uses a deep learning framework with ____ and ________
Tensorflow and MXNet
What is this called:
- SageMaker spins up a “HyperParameter Tuning Job” that trains as many combinations as you’ll allow
- Training instances are spun up as needed, potentially a lot of them
- The set of hyperparameters producing the best results can then be deployed as a model
- It learns as it goes, so it doesn’t have to try every possible
combination
Automatic Model Tuning
- Visual IDE for machine learning!
SageMaker Studio
Create and share
Jupyter notebooks with
SageMaker Studio
* Switch between
hardware configurations
(no infrastructure to
manage)
Sagemaker notebooks
- Organize, capture, compare, and search your ML jobs
Sagemaker experiments
- Saves internal model state at periodical intervals
- Gradients / tensors over time as a model is trained
- Define rules for detecting unwanted conditions while training
- A debug job is run for each rule you configure
- Logs & fires a CloudWatch event when the rule is hit
sagemaker debugger
- Automates:
- Algorithm selection
- Data preprocessing
- Model tuning
- All infrastructure
- It does all the trial & error for you
- More broadly this is called AutoML
Sagemaker autopilot
- Integrates with SageMaker Clarify
- Transparency on how models
arrive at predictions - Feature attribution
autopilot explainability
- Get alerts on quality
deviations on your deployed
models (via CloudWatch) - Visualize data drift
- Example: loan model starts
giving people more credit due
to drifting or missing input
features - Detect anomalies & outliers
- Detect new features
- No code needed
Sagemaker model monitor
- _________ detects potential bias
- i.e., imbalances across different groups / ages / income brackets
- With ModelMonitor, you can monitor for bias and be alerted to new potential bias via CloudWatch
- SageMaker Clarify also helps explain model behavior
- Understand which features contribute the
most to your predictions
SageMaker Clarify
- A “feature” is just a property used to train a machine learning model.
- Like, you might predict someone’s political party based on “features” such as their address, income, age, etc.
- Machine learning models require fast, secure access to feature data for training.
- It’s also a challenge to keep it
organized and share features
across different models.
sagemaker feature store
- Creates & stores your ML workflow (MLOps)
- Keep a running history of your models
- Tracking for auditing and compliance
- Automatically or manually-created tracking entities
- Integrates with AWS Resource Access Manager for cross-account lineage
SageMaker ML Lineage Tracking
- Visual interface (in SageMaker
Studio) to prepare data for machine learning - Import data
- Visualize data
- Transform data (300+
transformations to choose from) - Or integrate your own custom xformswith pandas, PySpark, PySpark SQL
- “Quick Model” to train your model with your data and measure its
results
Sagemaker data wrangler
- No-code machine learning for
business analysts - Upload csv data (csv only for now), select a column to predict, build it, and make predictions
- Can also join datasets
- Classification or regression
sagemaker canvas
________
* For asynchronous or real-time inference endpoints
* Controls shifting traffic to new models
* “Blue/Green Deployments”
* All at once: shift everything, monitor, terminate
blue fleet
* Canary: shift a small portion of traffic and
monitor
* Linear: Shift traffic in linearly spaced steps
* Auto-rollbacks
Deployment Guardrails
________
* Compare performance of shadow variant to production
* You monitor in SageMaker console and decide when to promote it
Shadow Tests
One facet (demographic group) has fewer training values than another
- Class Imbalance (CI)
- Imbalance of positive outcomes between facet values
- Difference in Proportions of Labels (DPL)
- How much outcome distributions of facets diverge
- Kullback-Leibler Divergence (KL), Jensen-Shannon
Divergence(JS)
- P-norm difference between distributions of outcomes from facets
- Lp-norm (LP)
- L1-norm difference between distributions of outcomes from facets
- Total Variation Distance (TVD)
- Maximum divergence between outcomes in distributions from facets
- Kolmogorov-Smirnov (KS)
- Disparity of outcomes between facets as a whole, and by subgroups
- Conditional Demographic Disparity (CDD)
- Integrated into AWS Deep Learning Containers
(DLCs) - Can’t bring your own container
- Compile & optimize training jobs on GPU instances
- Can accelerate training up to 50%
- Converts models into hardware-optimized instructions
- Tested with Hugging Face transformers library, or
bring your own model
SageMaker Training Compiler
What AI Service:
- Natural Language Processing and Text Analytics
- Input social media, emails, web pages, documents, transcripts, medical records (Comprehend Medical)
- Extract key phrases, entities, sentiment, language, syntax, topics, and document
classifications - Events detection
- PII Identification & Redaction
- Targeted sentiment (for specific entities)
- Can train on your own data
Amazon comprehend
What AI Service:
- Uses deep learning for translation
- Supports custom terminology
- In CSV or TMX format
- Appropriate for proper names, brand names, etc.
Amazon Translate
What AI service:
- Speech to text
- Input in FLAC, MP3, MP4, or WAV, in a specified language
- Streaming audio supported (HTTP/2 or WebSocket)
- French, English, Spanish only
- Speaker Identificiation
- Specify number of speakers
- Channel Identification
- i.e., two callers could be transcribed separately
- Merging based on timing of “utterances”
- Automatic Language Identification
- You don’t have to specify a language; it can detect the dominant one spoken.
- Custom Vocabularies
- Vocabulary Lists (just a list of special words – names, acronyms)
- Vocabulary Tables (can include “SoundsLike”, “IPA”, and “DisplayAs”)
Amazon Transcribe
What AI Service:
- Neural Text-To-Speech, many voices & languages
- Lexicons
- Customize pronunciation of specific words & phrases
- Example: “World Wide Web Consortium” instead of
“W3C” - SSML
- Alternative to plain text
- Speech Synthesis Markup Language
- Gives control over emphasis, pronunciation, breathing, whispering, speech rate, pitch, pauses.
- Speech Marks
- Can encode when sentence / word starts and ends in
the audio stream - Useful for lip-synching animation
Amazon Polly
What AI Service:
- Computer vision
- Object and scene detection
- Can use your own face collection
- Image moderation
- Facial analysis
- Celebrity recognition
- Face comparison
- Text in image
- Video analysis
- Objects / people / celebrities marked on timeline
- People Pathing
- Image and video libraries
Rekognition
What AI Service:
- Fully-managed service to deliver highly accurate forecasts with ML
*“AutoML” chooses best model for your time series data - ARIMA, DeepAR, ETS, NPTS, CNN-QR Prophet
- Works with any time series
- Price, promotions, economic performance, etc.
- Can combine with associated data to find relationships
- Inventory planning, financial planning,
resource planning - Based on “dataset groups,” “predictors,” and “forecasts.”
Amazon Forecast
What AI Tool:
- Billed as the inner workings of Alexa
- Natural-language chatbot engine
- A Bot is built around Intents
- Utterances invoke intents (“I want to order a pizza”)
- Lambda functions are invoked to fulfill the intent
- Slots specify extra information needed by the intent
- Pizza size, toppings, crust type, when to deliver, etc.
- Can deploy to AWS Mobile SDK, Facebook Messenger, Slack, and Twilio
Amazon Lex
What AI Service:
- Fully-managed recommender engine
- Same one Amazon uses
- API access
- Feed in data (purchases, ratings, impressions, cart adds, catalog, user demographics etc.) via S3 or API integration
- You provide an explicit schema in Avro format
- Javascript or SDK
- GetRecommendations
- Recommended products, content, etc.
- Similar items
- GetPersonalizedRanking
- Rank a list of items provided
- Allows editorial control / curation
Amazon Personalize
What AI Service:
* Equipment, metrics, vision
* Detects abnormalities from sensor data automatically to detect equipment issues
* Monitors metrics from S3, RDS, Redshift, 3rd party SaaS apps
* Vision uses computer vision to detect defects in silicon wafers, circuit boards, etc.
Amazon Lookout
What AI Service:
- End to end system for monitoring industrial equipment & predictive maintenance
Amazon Monitron
What AI Service:
- Computer Vision at the edge
- Brings computer vision to your existing IP cameras
AWS Panorama
What AI Tool:
- Upload your own historical fraud data
- Builds custom models from a template you choose
- Exposes an API for your online
application
Amazon Fraud Detector
What AI Service:
- Automated code reviews!
- Finds lines of code that hurt
performance - Resource leaks, race
conditions - Fix security vulnerabilities
Codeguru
What AI Service:
- For customer support call centers
- Ingests audio data from recorded calls
- Allows search on calls / chats
- Sentiment analysis
- Find “utterances” that correlate with successful calls
- Categorize calls automatically
- Measure talk speed and interruptions
- Theme detection: discovers
emerging issues
Contact Lens for Amazon Connect
What AI Service:
- Enterprise search with natural language
- For example, “Where is the IT support desk?” “How do I connect to my VPN?”
- Combines data from file systems, SharePoint, intranet, sharing services (JDBC, S3) into one searchable
repository - ML-powered (of course) – uses thumbs up / down feedback
- Relevance tuning – boost strength of document freshness, view counts, etc.
Amazon Kendra
What AI Service:
- Human review of ML predictions
- Builds workflows for reviewing low-confidence predictions
- Access the Mechanical Turk workforce or vendors
- Integrated into Amazon Textract and Rekognition
- Integrates with SageMaker
- Very similar to Ground Truth
Amazon Augmented AI (A2I)
- All models in SageMaker are hosted in ________
Docker containers
- Docker containers are
created from ______
images
- Images are built from a
_______
Dockerfile
- Images are saved in a
________
repository
- Train once, run anywhere
- Edge devices
- ARM, Intel, Nvidia processors
- Embedded in whatever – your car?
- Optimizes code for specific
devices - Tensorflow, MXNet, PyTorch,
ONNX, XGBoost, DarkNet, Keras - Consists of a compiler and a
runtime
Sagemaker Neo