AWS ML Course Flashcards
S3
Allows to store objects in buckets
centralized
allows object storage
Amazon S3 allows people to store objects (files) in “buckets” (directories)
Buckets must have a globally unique name
Objects (files) have a Key. The key is the FULL path: • /my_file.txt
• /my_folder1/another_folder/my_file.txt
This will be interesting when we look at partitioning
Max object size is 5TB
Object Tags (key / value pair – up to 10) – useful for security / lifecycle
Kinesis Firehose
Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as:
- Amazon Simple Storage Service (Amazon S3),
- Amazon Redshift,
- Amazon OpenSearch Service,
- Splunk,
INGESTION
Fully managed Ingest,
Transform, Load (ITL) solution
with no code required
store data into target applications
applications either send data directly to Kinesis Firehose or Kinesis Firehose reads data from KinesisDataStreams, Amazon CloudWatch or AWS iot
most common Firehose reading from Data Streams
Kinesis Streams
real time, data storage for 1 to 365 days
low latency streaming ingest at scale
Provisioned mode:
• You choose the number of shards provisioned, scale manually or using API • Each shard gets 1MB/s in (or 1000 records per second)
• Each shard gets 2MB/s out (classic or enhanced fan-out consumer)
• You pay per shard provisioned per hour
Redshift
Datawarehousing technology, needs to be provisioned in advance , can perform data warehousing analytics
Data Warehousing, SQL analytics (OLAP - Online analytical processing)
data organized in columns
- Load data from S3 to Redshift
- Use Redshift Spectrum to query data directly in S3 (no loading)
RDS, Aurora
Online transactional processing (OLTP) relational store, store data at the row level
not for machine learning
amazon data migration services
continuous data migration
Amazon S3: Object Storage for your data
VPC Endpoint Gateway: Privately access your S3 bucket without going through the public internet
Kinesis Data Streams: real-time data streams, need capacity planning, real-time applications
Kinesis Data Firehose: near real-time data ingestion to S3, Redshift, ElasticSearch, Splunk
Kinesis Data Analytics: SQL transformations on streaming data
Kinesis Video Streams: real-time video feeds
Glue Data Catalog & Crawlers: Metadata repositories for schemas and datasets in your account
Glue ETL: ETL Jobs as Spark programs, run on a serverless Spark Cluster
DynamoDB: NoSQL store
Redshift: Data Warehousing for OLAP, SQL language
Redshift Spectrum: Redshift on data in S3 (without the need to load it first in Redshift)
RDS / Aurora: Relational Data Store for OLTP, SQL language
ElasticSearch: index for your data, search capability, clickstream analytics
ElastiCache: data cache technology
Data Pipelines: Orchestration of ETL jobs between RDS, DynamoDB, S3. Runs on EC2 instances
Batch: batch jobs run as Docker containers - not just for data, manages EC2 instances for you
DMS: Database Migration Service, 1-to-1 CDC replication, no ETL
Step Functions: Orchestration of workflows, audit, retry mechanisms
Briefly mentioned, covered by Frank Kane:
EMR: Managed Hadoop Clusters
Quicksight: Visualization Tool
Rekognition: ML Service
SageMaker: ML Service
DeepLens: camera by Amazon
Athena: Serverless Query of your data
Amazon S3: Object Storage for your data
VPC Endpoint Gateway: Privately access your S3 bucket without going through the public internet
Kinesis Data Streams: real-time data streams, need capacity planning, real-time applications
Kinesis Data Firehose: near real-time data ingestion to S3, Redshift, ElasticSearch, Splunk
Kinesis Data Analytics: SQL transformations on streaming data
Kinesis Video Streams: real-time video feeds
Glue Data Catalog & Crawlers: Metadata repositories for schemas and datasets in your account
Glue ETL: ETL Jobs as Spark programs, run on a serverless Spark Cluster
DynamoDB: NoSQL store
Redshift: Data Warehousing for OLAP, SQL language
Redshift Spectrum: Redshift on data in S3 (without the need to load it first in Redshift)
RDS / Aurora: Relational Data Store for OLTP, SQL language
ElasticSearch: index for your data, search capability, clickstream analytics
ElastiCache: data cache technology
Data Pipelines: Orchestration of ETL jobs between RDS, DynamoDB, S3. Runs on EC2 instances
Batch: batch jobs run as Docker containers - not just for data, manages EC2 instances for you
DMS: Database Migration Service, 1-to-1 CDC replication, no ETL
Step Functions: Orchestration of workflows, audit, retry mechanisms
Briefly mentioned, covered by Frank Kane:
EMR: Managed Hadoop Clusters
Quicksight: Visualization Tool
Rekognition: ML Service
SageMaker: ML Service
DeepLens: camera by Amazon
Athena: Serverless Query of your data
Amazon EMR
Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.
Managed Hadoop framework on EC2 instances
Includes Spark, HBase, Presto, Flink, Hive & more
EMR Notebooks
Several integration points with AWS
Spot instance
Good choice for task nodes
• Only use on core & master if you’re testing or very cost-sensitive; you’re risking partial data loss
Feature Engineering
Applying your knowledge of the data – and the model you’re using - to create better features to train your model with.
SMOTE
Synthetic Minority Over-sampling TEchnique
Variance
Variance (𝜎2) is simply the average of the squared differences from the mean
Standard Deviation
Standard Deviation 𝜎 is just the square root of the variance.
𝜎2 = 5.04
𝜎 = 5.04 = 2.24
AWS’s Random Cut Forest algorithm
Remember AWS’s Random Cut Forest algorithm creeps into many of its services – it is made for outlier detection
• Found within QuickSight, Kinesis Analytics, SageMaker, and more
S3
Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance.
Binning
Bucket observations together based on ranges of values.
One-hot encoding
- Create “buckets” for every category
- The bucket for your category has a 1, all others have a 0
- Very common in deep learning, where categories are represented by individual output “neurons”
TF IDF
Term Frequency and Inverse Document Frequency • Important data for search – figures out what terms are most
relevant for a document
Comprehend
• AWS service for text analysis and topic modeling • Automatically classify text by topics, sentiment
TF
Term Frequency just measures how often a word occurs in a document
DF
Document Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page
Softmax function
The softmax function, also known as softargmax[1]: 184 or normalized exponential function,[2]: 198 is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce’s choice axiom.
ReLu
Rectified Linear Unit
Other ReLU variants
- Maxout
- Outputs the max of the inputs
- Technically ReLU is a special case of maxout
- But doubles parameters that need to be trained, not often practical.
SoftMax
Used on the final output layer of a multiple classification problem
• Basically converts outputs to probabilities of each classification
• Can’t produce more than one label for something (sigmoid can)
• Don’t worry about the actual function for the exam, just know what it’s used for.
CNN: what is is used for
When you have data that doesn’t neatly align into columns
• Images that you want to find features within
• Machine translation
• Sentence classification
• Sentiment analysis
• They can find features that aren’t in a specific spot
• Like a stop sign in a picture • Or words within a sentence
• They are “feature-location invariant”
Multilayer Perceptron MLP
A multilayer perceptron is a fully connected class of feedforward artificial neural network. The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons;
CNN’s are hard
Very resource-intensive (CPU, GPU, and RAM)
• Lots of hyperparameters
• Kernel sizes, many layers with different numbers of units, amount of pooling… in addition to the usual stuff like number of layers, choice of optimizer
• Getting the training data is often the hardest part! (As well as storing and accessing it)
LeNet-5
• Good for handwriting recognition
AlexNet
• Image classification, deeper than LeNet
GoogLeNet
- Even deeper, but with better performance
- Introduces inception modules (groups of convolution layers)
• ResNet (Residual Network)
• Even deeper – maintains performance via skip connections.
RNN’s: what are they for?
Time-series data
• When you want to predict future behavior based on past behavior
• Web logs, sensor logs, stock trades
• Where to drive your self-driving car based on past trajectories
• Data that consists of sequences of arbitrary length
• Machine translation
• Image captions
• Machine-generated music
RNN topologies
Sequence to sequence
• i.e., predict stock prices based on series of historical data
• Sequence to vector
• i.e., words in a sentence to sentiment
• Vector to sequence
• i.e., create captions from an image
• Encoder -> Decoder
• Sequence -> vector -> sequence • i.e., machine translation
LSTM Cell
LSTM Cell
• Long Short-Term Memory Cell
• Maintains separate short-term and long-term states
part of RNN, dealing with sequence in time
GRU Cell
GRU Cell
• Gated Recurrent Unit
• Simplified LSTM Cell that performs about as well
Learning Rate
Learning Rate
• Neural networks are trained by gradient descent (or similar means)
• We start at some random point, and sample different solutions (weights) seeking to minimize some cost function, over many epochs
• How far apart these samples are is the learning rate
Effect of learning rate
Small batch sizes tend to not get stuck in local minima
• Large batch sizes can converge on the wrong solution at random
• Large learning rates can overshoot the correct solution
• Small learning rates increase training time
learning
try and find the lowest point on graph
What is regularization?
Regularization techniques are intended to prevent overfitting.
L1L2 regularization
L1 term is the sum of the weights •𝜆σ𝑘 𝑤
𝑖=1 𝑖
• L2 term is the sum of the square of the weights
•𝜆σ𝑘 𝑤2 𝑖=1 𝑖
• Same idea can be applied to loss functions
L1, L2 What is the difference?
What’s the difference?
L1: sum of weights
• Performs feature selection – entire features go to 0 • Computationally inefficient
• Sparse output
L2: sum of square of weights
• All features remain considered, just weighted • Computationally efficient
• Dense output
Recall or True Positive Rate
AKA Sensitivity, True Positive rate, Completeness
- Percent of positives rightly predicted
- Good choice of metric when you care a lot about false negatives
- i.e., fraud detection
of all the values that are actually positive, how may are correctly identified as positive.
F1
Harmonic mean of precision and sensitivity
When you care about precision AND recall
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
Precision or PPV (positive predicted value)
- AKA Correct Positives
- Percent of relevant results
- Good choice of metric when you care a lot about false positives
- i.e., medical screening, drug testing
of all the values that are predicted positive, how many are actuallypositive
Ensemble Method
use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.[
Ensemble Method
use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.[
Bagging vs Boosting
Bagging vs. Boosting
- XGBoost is the latest hotness
- Boosting generally yields better accuracy • But bagging avoids overfitting
- Bagging is easier to parallelize
- So, depends on your goal
Bagging
Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After several data samples are generated, these weak models are then trained independently, and depending on the type of task—regression or classification, for example—the average or majority of those predictions yield a more accurate estimate.
As a note, the random forest algorithm is considered an extension of the bagging method, using both bagging and feature randomness to create an uncorrelated forest of decision trees.
Boosting
Observations are weighted
- Some will take part in new training sets more often
- Training is sequential; each classifier takes into account the previous one’s success.
Sagemaker
SageMaker is built to handle the entire machine learning workflow.
SageMaker NoteBook and SageMaker Console can direct the process
File Mode
copy all your data over as a single file all at once
Pipe mode, will stream it in as needed
if S3 is taking too long then use pipe mode
Data prep on SageMaker
Data usually comes from S3
- Ideal format varies with algorithm – often it is RecordIO / Protobuf for pre-built models
Can also ingest from Athena, EMR, Redshift and Amazon Keyspaces DB
Apache Spark integrates with SageMaker
Scikit_learn, numpy, pandas all at your disposal within a notebook
Training on SageMaker
Create a training job
• URL of S3 bucket with training data • ML compute resources
• URL of S3 bucket for output
• ECR path to training code
Training options
- Built-in training algorithms
- Spark MLLib
- Custom Python Tensorflow / MXNet code
- Your own Docker image
- Algorithm purchased from AWS marketplace
Linear Learner: What’s it for?
Linear regression
- Fit a line to your training data
- Predications based on that line
Can handle both regression (numeric) predictions and classification predictions
- For classification, a linear threshold function is used.
- Can do binary or multi-class
not a neural network, but can work on handwriting recognition
Linear Learner: What training input does
it expect?
- RecordIO-wrapped protobuf • Float32 data only!
- CSV
- First column assumed to be the label
- File or Pipe mode both supported
Linear Learner: How is it used?
Preprocessing
- Training data must be normalized (so all features are weighted the same)
- Linear Learner can do this for you automatically
- Input data should be shuffled
Training
- Uses stochastic gradient descent (SGD)
- Choose an optimization algorithm (Adam, AdaGrad, SGD, etc)
- Multiple models are optimized in parallel
- Tune L1 (feature selection), L2 regularization (feature weighting)
Validation
- Most optimal model is selected
XG Boost
eXtreme Gradient Boosting
- Boosted group of decision trees
- New trees made to correct the errors of previous trees
- Uses gradient descent to minimize loss as new trees are added
It’s been winning a lot of Kaggle competitions
- And it’s fast, too
Can be used for classification
And also for regression • Using regression trees
XGBoost: What training input does it expect?
- XGBoost is weird, since it’s not made for SageMaker. It’s just open source XGBoost
- So, it takes CSV or libsvm input.
- AWS recently extended it to accept recordIO-protobuf and Parquet as well.
XGBoost: How is it used?
Models are serialized/deserialized with Pickle
Can use as a framework within notebooks • Sagemaker.xgboost
Or as a built-in SageMaker algorithm
Pickle is a useful Python tool that allows you to save your models, to minimise lengthy re-training and allow you to share, commit, and re-load pre-trained machine learning models. Pickle is a generic object serialization module that can be used for serializing and deserializing objects.
XGBoost: Important Hyperparameters
There are a lot of them. A few:
-
Subsample
- Prevents overfitting
- Eta
- Step size shrinkage, prevents overfitting
- Gamma
- Minimum loss reduction to create a partition; larger = more conservative
- Alpha
- L1 regularization term; larger = more conservative
- Lambda
- L2 regularization term; larger = more conservative
XGBoost: Important Hyperparameters
eval_metric: allows to set the metric you are optimizing on
- Optimize on AUC, error, rmse…
- For example, if you care about false positives more than accuracy, you might use AUC here
scale_pos_weight
- Adjusts balance of positive and negative weights
- Helpful for unbalanced classes
- Might set to sum(negative cases) / sum(positive cases)
max_depth
- Max depth of the tree
- Too high and you may overfit