AWS ML Course Flashcards

1
Q

S3

A

Allows to store objects in buckets
centralized
allows object storage

Amazon S3 allows people to store objects (files) in “buckets” (directories)

Buckets must have a globally unique name

Objects (files) have a Key. The key is the FULL path: • /my_file.txt
• /my_folder1/another_folder/my_file.txt

This will be interesting when we look at partitioning

Max object size is 5TB

Object Tags (key / value pair – up to 10) – useful for security / lifecycle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis Firehose

A

Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as:

  • Amazon Simple Storage Service (Amazon S3),
  • Amazon Redshift,
  • Amazon OpenSearch Service,
  • Splunk,

INGESTION

Fully managed Ingest,
Transform, Load (ITL) solution
with no code required

store data into target applications

applications either send data directly to Kinesis Firehose or Kinesis Firehose reads data from KinesisDataStreams, Amazon CloudWatch or AWS iot

most common Firehose reading from Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Kinesis Streams

A

real time, data storage for 1 to 365 days

low latency streaming ingest at scale

Provisioned mode:
• You choose the number of shards provisioned, scale manually or using API • Each shard gets 1MB/s in (or 1000 records per second)
• Each shard gets 2MB/s out (classic or enhanced fan-out consumer)
• You pay per shard provisioned per hour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Redshift

A

Datawarehousing technology, needs to be provisioned in advance , can perform data warehousing analytics

Data Warehousing, SQL analytics (OLAP - Online analytical processing)

data organized in columns

  • Load data from S3 to Redshift
  • Use Redshift Spectrum to query data directly in S3 (no loading)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RDS, Aurora

A

Online transactional processing (OLTP) relational store, store data at the row level

not for machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

amazon data migration services

A

continuous data migration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Amazon S3: Object Storage for your data

VPC Endpoint Gateway: Privately access your S3 bucket without going through the public internet

Kinesis Data Streams: real-time data streams, need capacity planning, real-time applications

Kinesis Data Firehose: near real-time data ingestion to S3, Redshift, ElasticSearch, Splunk

Kinesis Data Analytics: SQL transformations on streaming data

Kinesis Video Streams: real-time video feeds

Glue Data Catalog & Crawlers: Metadata repositories for schemas and datasets in your account

Glue ETL: ETL Jobs as Spark programs, run on a serverless Spark Cluster

DynamoDB: NoSQL store

Redshift: Data Warehousing for OLAP, SQL language

Redshift Spectrum: Redshift on data in S3 (without the need to load it first in Redshift)

RDS / Aurora: Relational Data Store for OLTP, SQL language

ElasticSearch: index for your data, search capability, clickstream analytics

ElastiCache: data cache technology

Data Pipelines: Orchestration of ETL jobs between RDS, DynamoDB, S3. Runs on EC2 instances

Batch: batch jobs run as Docker containers - not just for data, manages EC2 instances for you

DMS: Database Migration Service, 1-to-1 CDC replication, no ETL

Step Functions: Orchestration of workflows, audit, retry mechanisms

Briefly mentioned, covered by Frank Kane:

EMR: Managed Hadoop Clusters

Quicksight: Visualization Tool

Rekognition: ML Service

SageMaker: ML Service

DeepLens: camera by Amazon

Athena: Serverless Query of your data

A

Amazon S3: Object Storage for your data

VPC Endpoint Gateway: Privately access your S3 bucket without going through the public internet

Kinesis Data Streams: real-time data streams, need capacity planning, real-time applications

Kinesis Data Firehose: near real-time data ingestion to S3, Redshift, ElasticSearch, Splunk

Kinesis Data Analytics: SQL transformations on streaming data

Kinesis Video Streams: real-time video feeds

Glue Data Catalog & Crawlers: Metadata repositories for schemas and datasets in your account

Glue ETL: ETL Jobs as Spark programs, run on a serverless Spark Cluster

DynamoDB: NoSQL store

Redshift: Data Warehousing for OLAP, SQL language

Redshift Spectrum: Redshift on data in S3 (without the need to load it first in Redshift)

RDS / Aurora: Relational Data Store for OLTP, SQL language

ElasticSearch: index for your data, search capability, clickstream analytics

ElastiCache: data cache technology

Data Pipelines: Orchestration of ETL jobs between RDS, DynamoDB, S3. Runs on EC2 instances

Batch: batch jobs run as Docker containers - not just for data, manages EC2 instances for you

DMS: Database Migration Service, 1-to-1 CDC replication, no ETL

Step Functions: Orchestration of workflows, audit, retry mechanisms

Briefly mentioned, covered by Frank Kane:

EMR: Managed Hadoop Clusters

Quicksight: Visualization Tool

Rekognition: ML Service

SageMaker: ML Service

DeepLens: camera by Amazon

Athena: Serverless Query of your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Amazon EMR

A

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.

Managed Hadoop framework on EC2 instances

Includes Spark, HBase, Presto, Flink, Hive & more

EMR Notebooks

Several integration points with AWS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Spot instance

A

Good choice for task nodes
• Only use on core & master if you’re testing or very cost-sensitive; you’re risking partial data loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Feature Engineering

A

Applying your knowledge of the data – and the model you’re using - to create better features to train your model with.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SMOTE

A

Synthetic Minority Over-sampling TEchnique

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Variance

A

Variance (𝜎2) is simply the average of the squared differences from the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standard Deviation

A

Standard Deviation 𝜎 is just the square root of the variance.
𝜎2 = 5.04
𝜎 = 5.04 = 2.24

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AWS’s Random Cut Forest algorithm

A

Remember AWS’s Random Cut Forest algorithm creeps into many of its services – it is made for outlier detection
• Found within QuickSight, Kinesis Analytics, SageMaker, and more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

S3

A

Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Binning

A

Bucket observations together based on ranges of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

One-hot encoding

A
  • Create “buckets” for every category
  • The bucket for your category has a 1, all others have a 0
  • Very common in deep learning, where categories are represented by individual output “neurons”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

TF IDF

A

Term Frequency and Inverse Document Frequency • Important data for search – figures out what terms are most
relevant for a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Comprehend

A

• AWS service for text analysis and topic modeling • Automatically classify text by topics, sentiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

TF

A

Term Frequency just measures how often a word occurs in a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

DF

A

Document Frequency is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Softmax function

A

The softmax function, also known as softargmax[1]: 184  or normalized exponential function,[2]: 198  is a generalization of the logistic function to multiple dimensions. It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce’s choice axiom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ReLu

A

Rectified Linear Unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Other ReLU variants

A
  • Maxout
  • Outputs the max of the inputs
  • Technically ReLU is a special case of maxout
  • But doubles parameters that need to be trained, not often practical.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

SoftMax

A

Used on the final output layer of a multiple classification problem
• Basically converts outputs to probabilities of each classification
• Can’t produce more than one label for something (sigmoid can)
• Don’t worry about the actual function for the exam, just know what it’s used for.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

CNN: what is is used for

A

When you have data that doesn’t neatly align into columns
• Images that you want to find features within
• Machine translation
• Sentence classification
• Sentiment analysis
• They can find features that aren’t in a specific spot
• Like a stop sign in a picture • Or words within a sentence
• They are “feature-location invariant”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Multilayer Perceptron MLP

A

A multilayer perceptron is a fully connected class of feedforward artificial neural network. The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

CNN’s are hard

A

Very resource-intensive (CPU, GPU, and RAM)
• Lots of hyperparameters
• Kernel sizes, many layers with different numbers of units, amount of pooling… in addition to the usual stuff like number of layers, choice of optimizer
• Getting the training data is often the hardest part! (As well as storing and accessing it)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

LeNet-5

A

• Good for handwriting recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

AlexNet

A

• Image classification, deeper than LeNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

GoogLeNet

A
  • Even deeper, but with better performance
  • Introduces inception modules (groups of convolution layers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

• ResNet (Residual Network)

A

• Even deeper – maintains performance via skip connections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

RNN’s: what are they for?

A

Time-series data
• When you want to predict future behavior based on past behavior
• Web logs, sensor logs, stock trades
• Where to drive your self-driving car based on past trajectories
• Data that consists of sequences of arbitrary length
• Machine translation
• Image captions
• Machine-generated music

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

RNN topologies

A

Sequence to sequence
• i.e., predict stock prices based on series of historical data
• Sequence to vector
• i.e., words in a sentence to sentiment
• Vector to sequence
• i.e., create captions from an image
• Encoder -> Decoder
• Sequence -> vector -> sequence • i.e., machine translation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

LSTM Cell

A

LSTM Cell
• Long Short-Term Memory Cell
• Maintains separate short-term and long-term states

part of RNN, dealing with sequence in time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

GRU Cell

A

GRU Cell
• Gated Recurrent Unit
• Simplified LSTM Cell that performs about as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Learning Rate

A

Learning Rate
• Neural networks are trained by gradient descent (or similar means)
• We start at some random point, and sample different solutions (weights) seeking to minimize some cost function, over many epochs
• How far apart these samples are is the learning rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Effect of learning rate

A

Small batch sizes tend to not get stuck in local minima
• Large batch sizes can converge on the wrong solution at random
• Large learning rates can overshoot the correct solution
• Small learning rates increase training time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

learning

A

try and find the lowest point on graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is regularization?

A

Regularization techniques are intended to prevent overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

L1L2 regularization

A

L1 term is the sum of the weights •𝜆σ𝑘 𝑤
𝑖=1 𝑖
• L2 term is the sum of the square of the weights
•𝜆σ𝑘 𝑤2 𝑖=1 𝑖
• Same idea can be applied to loss functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

L1, L2 What is the difference?

A

What’s the difference?
L1: sum of weights
• Performs feature selection – entire features go to 0 • Computationally inefficient
• Sparse output
L2: sum of square of weights
• All features remain considered, just weighted • Computationally efficient
• Dense output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Recall or True Positive Rate

A


AKA Sensitivity, True Positive rate, Completeness

  • Percent of positives rightly predicted
  • Good choice of metric when you care a lot about false negatives
  • i.e., fraud detection

of all the values that are actually positive, how may are correctly identified as positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

F1

A


Harmonic mean of precision and sensitivity

When you care about precision AND recall

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Precision or PPV (positive predicted value)

A
  • AKA Correct Positives
  • Percent of relevant results
  • Good choice of metric when you care a lot about false positives
  • i.e., medical screening, drug testing

of all the values that are predicted positive, how many are actuallypositive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Ensemble Method

A

use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.[

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Ensemble Method

A

use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.[

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Bagging vs Boosting

A

Bagging vs. Boosting

  • XGBoost is the latest hotness
  • Boosting generally yields better accuracy • But bagging avoids overfitting
  • Bagging is easier to parallelize
  • So, depends on your goal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Bagging

A

Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After several data samples are generated, these weak models are then trained independently, and depending on the type of task—regression or classification, for example—the average or majority of those predictions yield a more accurate estimate.

As a note, the random forest algorithm is considered an extension of the bagging method, using both bagging and feature randomness to create an uncorrelated forest of decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Boosting

A

Observations are weighted

  • Some will take part in new training sets more often
  • Training is sequential; each classifier takes into account the previous one’s success.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Sagemaker

A

SageMaker is built to handle the entire machine learning workflow.

SageMaker NoteBook and SageMaker Console can direct the process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

File Mode

A

copy all your data over as a single file all at once

Pipe mode, will stream it in as needed

if S3 is taking too long then use pipe mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Data prep on SageMaker

A

Data usually comes from S3

  • Ideal format varies with algorithm – often it is RecordIO / Protobuf for pre-built models

Can also ingest from Athena, EMR, Redshift and Amazon Keyspaces DB

Apache Spark integrates with SageMaker

Scikit_learn, numpy, pandas all at your disposal within a notebook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Training on SageMaker

A

Create a training job
• URL of S3 bucket with training data • ML compute resources
• URL of S3 bucket for output
• ECR path to training code

Training options

  • Built-in training algorithms
  • Spark MLLib
  • Custom Python Tensorflow / MXNet code
  • Your own Docker image
  • Algorithm purchased from AWS marketplace
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Linear Learner: What’s it for?

A

Linear regression

  • Fit a line to your training data
  • Predications based on that line

Can handle both regression (numeric) predictions and classification predictions

  • For classification, a linear threshold function is used.
  • Can do binary or multi-class

not a neural network, but can work on handwriting recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Linear Learner: What training input does

it expect?

A
  • RecordIO-wrapped protobuf • Float32 data only!
  • CSV
    • First column assumed to be the label
  • File or Pipe mode both supported
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Linear Learner: How is it used?

A

Preprocessing

  • Training data must be normalized (so all features are weighted the same)
  • Linear Learner can do this for you automatically
  • Input data should be shuffled

Training

  • Uses stochastic gradient descent (SGD)
  • Choose an optimization algorithm (Adam, AdaGrad, SGD, etc)
  • Multiple models are optimized in parallel
  • Tune L1 (feature selection), L2 regularization (feature weighting)

Validation

  • Most optimal model is selected
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

XG Boost

A

eXtreme Gradient Boosting

  • Boosted group of decision trees
  • New trees made to correct the errors of previous trees
  • Uses gradient descent to minimize loss as new trees are added

It’s been winning a lot of Kaggle competitions

  • And it’s fast, too

Can be used for classification

And also for regression • Using regression trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

XGBoost: What training input does it expect?

A
  • XGBoost is weird, since it’s not made for SageMaker. It’s just open source XGBoost
  • So, it takes CSV or libsvm input.
  • AWS recently extended it to accept recordIO-protobuf and Parquet as well.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

XGBoost: How is it used?

A

Models are serialized/deserialized with Pickle

Can use as a framework within notebooks • Sagemaker.xgboost

Or as a built-in SageMaker algorithm

Pickle is a useful Python tool that allows you to save your models, to minimise lengthy re-training and allow you to share, commit, and re-load pre-trained machine learning models. Pickle is a generic object serialization module that can be used for serializing and deserializing objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

XGBoost: Important Hyperparameters

A

There are a lot of them. A few:

  • Subsample
    • Prevents overfitting
  • Eta
    • Step size shrinkage, prevents overfitting
  • Gamma
    • Minimum loss reduction to create a partition; larger = more conservative
  • Alpha
    • L1 regularization term; larger = more conservative
  • Lambda
    • L2 regularization term; larger = more conservative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

XGBoost: Important Hyperparameters

A

eval_metric: allows to set the metric you are optimizing on

  • Optimize on AUC, error, rmse…
  • For example, if you care about false positives more than accuracy, you might use AUC here

scale_pos_weight

  • Adjusts balance of positive and negative weights
  • Helpful for unbalanced classes
  • Might set to sum(negative cases) / sum(positive cases)

max_depth

  • Max depth of the tree
  • Too high and you may overfit
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

XGBoost: Instance Types

A

Uses CPU’s only for multiple instance training

Is memory-bound, not compute- bound

So, M5 is a good choice

As of XGBoost 1.2, single-instance GPU training is available

  • For example P3
  • Must set tree_method hyperparameter to gpu_hist
  • Trains more quickly and can be more cost effective.
64
Q

Seq2Seq: What’s it for?

A

Input is a sequence of tokens, output is a sequence of tokens

Machine Translation

Text summarization

Speech to text

Implemented with RNN’s and CNN’s with attention

65
Q

Seq2Seq: What training input does it expect?

A

RecordIO-Protobuf

  • Tokens must be integers (this is unusual, since most algorithms want floating point data.)

Start with tokenized text files

Convert to protobuf using sample code

  • Packs into integer tensors with vocabulary files
  • A lot like the TF/IDF lab we did earlier.

Must provide training data, validation data, and vocabulary files.

66
Q

Seq2Seq: How is it used?

A

Training for machine translation can take days, even on SageMaker

Pre-trained models are available

  • See the example notebook

Public training datasets are available for specific translation tasks

67
Q

Seq2Seq: Important Hyperparameters

A
  • Batch_size
  • Optimizer_type (adam, sgd, rmsprop)
  • Learning_rate
  • Num_layers_encoder
  • Num_layers_decoder
  • Can optimize on:
    • Accuracy
      • Vs. provided validation dataset
    • BLEU score
      • Compares against multiple reference translations
    • Perplexity
      • Cross-entropy
68
Q

Seq2Seq: Instance Types

A

Can only use GPU instance types (P3 for example)

Can only use a single machine for training, so cannot be parallelize

  • But can use multi-GPU’s on one machine
69
Q

protobuf recordIO

A

The protobuf recordIO format, used for training data, is the optimal way to load data into your model for training.
When you use the protobuf recordIO format you can also take advantage of pipe mode when training your model. Pipe mode, used together with the protobuf recordIO format, gives you the best data load performance by streaming your data directly from S3 to your EBS volumes used by your training instance.

70
Q

DeepAR: What’s it for? Time Series!!!!

A
  • Forecasting one-dimensional time series data, e.g. stock prices
  • Uses RNN’s
  • Allows you to train the same model over several related time series
  • Finds frequencies and seasonality
71
Q

DeepAR: What training input does it expect?

A

JSON lines format

  • Gzip or Parquet

Each record must contain:

  • Start: the starting time stamp
  • Target: the time series values

Each record can contain:

  • Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series of product purchases)
  • Cat: categorical features

{“start”: “2009-11-01 00:00:00”, “target”: [4.3, “NaN”, 5.1, …], “cat”: [0, 1], “dynamic_feat”: [[1.1, 1.2, 0.5, …]]} {“start”: “2012-01-30 00:00:00”, “target”: [1.0, -5.0, …], “cat”: [2, 3], “dynamic_feat”: [[1.1, 2.05, …]]}
{“start”: “1999-01-30 00:00:00”, “target”: [2.0, 1.0], “cat”: [1, 4], “dynamic_feat”: [[1.3, 0.4]]}

72
Q

DeepAR: How is it used?

A

Always include entire time series for training, testing, and inference

Use entire dataset as training set, remove last time points for testing. Evaluate on withheld values.

Don’t use very large values for prediction length (> 400)

Train on many time series and not just one when possible

73
Q

DeepAR: Important Hyperparameters

A

Context_length

  • Number of time points the model sees before making a prediction
  • Can be smaller than seasonalities; the model will lag one year anyhow.
  • Epochs
  • mini_batch_size • Learning_rate
  • Num_cells
74
Q

DeepAR: Instance Types

A

Can use CPU or GPU

Single or multi machine

Start with CPU (C4.2xlarge, C4.4xlarge)

Move up to GPU if necessary • Only helps with larger models

CPU-only for inference

May need larger instances for tuning

75
Q

BlazingText: What’s it for?

A

Text classification

  • Predict labels for a sentence (NOT ean entire document) IF it is trained
  • Useful in web searches, information retrieval
  • Supervised text classification OR

Word2vec

  • Creates a vector representation of words
  • Semantically similar words are represented by vectors close to each other
  • This is called a word embedding
  • It is useful for NLP, but is not an NLP algorithm in itself!
    • Used in machine translation, sentiment analysis
  • Remember it only works on individual words, not sentences or documents
  • gives embedding layer that puts similar words close to each other
76
Q

BlazingText: What training input does it expect?

A

For supervised mode (text classification):

  • One sentence per line
  • First “word” in the sentence is the string __label__ followed by the label

Also, “augmented manifest text format”

Word2vec just wants a text file with one training sentence per line.

__label__4 linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2 bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .

{“source”:”linux ready for prime time , intel says , despite all the linux hype”, “label”:1} {“source”:”bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly”, “label”:2}

77
Q

BlazingText: How is it used?

A

Word2vec has multiple modes

  • Cbow (Continuous Bag of Words)
  • Skip-gram (this is an N-gram and order does matter)
  • Batch skip-gram
    • Distributed computation over many CPU nodes
78
Q

BlazingText: Important Hyperparameters

A

Word2vec: ie the embedding stuff

  • Mode: Required The Word2vec architecture used for training.
    • valid values: batch_skipgram, skipgram, cbow
  • Learning_rate
  • Window_size
  • Vector_dim
  • Negative_samples

Text classification:

  • Mode: training mode, supervised, Required
  • buckets, early stopping
  • Epochs
  • Learning_rate
  • Word_ngrams: (how many words we are putting together)
  • Vector_dim
79
Q

BlazingText: Instance Types

A

For cbow and skipgram, recommend a single ml.p3.2xlarge

  • Any single CPU or single GPU instance will work

For batch_skipgram, can use single or multiple CPU instances

For text classification, C5 recommended if less than 2GB training data. For larger data sets, use a single GPU instance (ml.p2.xlarge or ml.p3.2xlarge)

80
Q

Object Detection: What’s it for?

A
  • Identify all objects in an image with bounding boxes
  • Detects and classifies objects with a single deep neural network
  • Classes are accompanied by confidence scores
  • Can train from scratch, or use pre- trained models based on ImageNet
81
Q

Object2Vec: What’s it for?

A

Remember word2vec from Blazing Text? It’s like that, but arbitrary objects

It creates low-dimensional dense embeddings of high-dimensional objects

It is basically word2vec, generalized to handle things other than words.

Compute nearest neighbors of objects

Visualize clusters

Genre prediction

Recommendations (similar items or users)

82
Q

Object2Vec: What training input does it expect?

A
  • Data must be tokenized into integers
  • Training data consists of pairs of tokens and/or sequences of tokens
  • Sentence – sentence
  • Labels-sequence (genre to description?) • Customer-customer
  • Product-product
  • User-item

{“label”: 0, “in0”: [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], “in1”: [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]} {“label”: 1, “in0”: [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], “in1”: [22, 32, 13, 25, 1016, 573, 3252, 4]} {“label”: 1, “in0”: [774, 14, 21, 206], “in1”: [21, 366, 125]}

83
Q

Object2Vec: How is it used?

A
  • Process data into JSON Lines and shuffle it
  • Train with two input channels, two encoders, and a comparator
  • Encoder choices:
  • Average-pooled embeddings • CNN’s
  • Bidirectional LSTM

• Comparator is followed by a feed-forward neural network

84
Q

Object2Vec: Important Hyperparameters

A
  • The usual deep learning ones…
  • Dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay
  • Enc1_network, enc2_network
  • Choose hcnn, bilstm, pooled_embedding
85
Q

Object2Vec: Instance Types

A

• Can only train on a single machine (CPU or GPU, multi-GPU OK)

Ml.m5.2xlarge

Ml.p2.xlarge

If needed, go up to ml.m5.4xlarge or ml.m5.12xlarge

  • Inference: use ml.p2.2xlarge
  • Use INFERENCE_PREFERRED_MODE environment variable to optimize for encoder embeddings rather than classification or regression.
86
Q

Object Detection: What training input does it expect?

A
  • RecordIO or image format (jpg or png)
  • With image format, supply a JSON file for annotation data for each image

{
“file”: “your_image_directory/sample_image1.jpg”, “image_size”: [

{
“width”: 500, “height”: 400, “depth”: 3

} ],

“annotations”: [ {

“class_id”: 0, “left”: 111, “top”: 134, “width”: 61, “height”: 128

}, ],

“categories”: [ {

“class_id”: 0,

“name”: “dog” },

87
Q

Object Detection: How is it used?

A

• Takes an image as input, outputs all instances of objects in the image with categories and confidence scores

Uses a CNN with the Single Shot multibox Detector (SSD) algorithm

• The base CNN can be VGG-16 or ResNet-50

Transfer learning mode / incremental training

  • Use a pre-trained model for the base network weights, instead of random initial weights
  • Uses flip, rescale, and jitter internally to avoid overfitting
88
Q

Object Detection: Important Hyperparameters

A
  • Mini_batch_size
  • Learning_rate
  • Optimizer
  • Sgd, adam, rmsprop, adadelta
89
Q

Semantic Segmentation: What’s it for?

A

Pixel-level object classification

Different from image classification – that assigns labels to whole images

Different from object detection – that assigns labels to bounding boxes

Useful for self-driving vehicles, medical imaging diagnostics, robot sensing

Produces a segmentation mask

90
Q

Semantic Segmentation: What training input does it expect?

A

JPG Images and PNG annotations

For both training and validation

Label maps to describe annotations

Augmented manifest image format supported for Pipe mode.

JPG images accepted for inference

91
Q

Semantic Segmentation: How is it used?

A

• Built on MXNet Gluon and Gluon CV

  • Choice of 3 algorithms:
  • Fully-Convolutional Network (FCN) • Pyramid Scene Parsing (PSP)
  • DeepLabV3

• Choice of backbones: • ResNet50

  • ResNet101
  • Both trained on ImageNet

• Incremental training, or training from scratch, supported too

92
Q

Semantic Segmentation: Important Hyperparameters

A
  • Epochs, learning rate, batch size, optimizer, etc
  • Algorithm • Backbone
93
Q

Semantic Segmentation: Instance Types

A

Only GPU supported for training (P2 or P3) on a single machine only

  • Specifically ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, or ml.p3.16xlarge
  • Inference on CPU (C5 or M5) or GPU (P2 or P3)
94
Q

Random Cut Forest: What’s it for?

A

Anomaly detection

Unsupervised

Detect unexpected spikes in time series data

Breaks in periodicity

Unclassifiable data points

Assigns an anomaly score to each data point

Based on an algorithm developed by Amazon that they seem to be very proud of!

95
Q

Random Cut Forest: What training input does it expect?

A

RecordIO-protobuf or CSV

Can use File or Pipe mode on either

Optional test channel for computing accuracy, precision, recall, and F1 on labeled data (anomaly or not)

96
Q

Random Cut Forest: How is it used?

A
  • Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it
  • Data is sampled randomly
  • Then trained
  • RCF shows up in Kinesis Analytics as well; it can work on streaming data too.
97
Q

Random Cut Forest: Important Hyperparameters

A
  • Num_trees
  • Increasing reduces noise
  • Num_samples_per_tree
  • Should be chosen such that 1/num_samples_per_tree approximates the ratio of anomalous to normal data
98
Q

Random Cut Forest: Instance Types

A

Does not take advantage of GPUs • Use M4, C4, or C5 for training
• ml.c5.xl for inference

99
Q

PCA

A

Dimensiality Reduction technique

reduce to smaller number of features and attributes

In Amazon SageMaker, PCA operates in two modes, depending on the scenario:

regular: For datasets with sparse data and a moderate number of observations and features.
randomized: For datasets with both a large number of observations and features. This mode uses an approximation algorithm.

100
Q

Factorization Machines

A

Good algorithm for recommender systems

good for sparse data,

Classification or regression

Factorization Machines cannot be used for finding similar population segments (groupings).

Training data need to be in: recordIO-protobuf with Float32
• Sparse data means CSV isn’t practical

101
Q

IP Insights

A

indentify anomolous behavior from IP Adresses

CSV files only!

102
Q

Sagemaker Autopilot

A

Can add in human guidance

• With or without code in SageMaker Studio or AWS SDK’s

  • Problem types:
  • Binary classification
  • Multiclass classification • Regression
  • Algorithm Types:
  • Linear Learner
  • XGBoost
  • Deep Learning (MLP’s)

• Data must be tabular CSV

103
Q

Amazon Comprehend

A

• Natural Language Processing and Text Analytics

Input social media, emails, web pages, documents, transcripts, medical records (Comprehend Medical)

Extract key phrases, entities, sentiment, language, syntax, topics, and document classifications

Can train on your own data

104
Q

IP Insights: What’s it for?

A
  • Unsupervised learning of IP address usage patterns
  • Identifies suspicious behavior from IP addresses
  • Identify logins from anomalous IP’s
  • Identify accounts creating resources from anomalous IP’s
105
Q

IP Insights: What training input does it expect?

A
  • User names, account ID’s can be fed in directly; no need to pre-process
  • Training channel, optional validation (computes AUC score)
  • CSV only • Entity, IP
106
Q

IP Insights: How is it used?

A

  • Uses a neural network to learn latent vector representations of entities and IP addresses.
  • Entities are hashed and embedded • Need sufficiently large hash size
  • Automatically generates negative samples during training by randomly pairing entities and IP’s
107
Q

IP Insights: Important Hyperparameters

A

• Num_entity_vectors

Hash size

Set to twice the number of unique entity identifiers

  • Vector_dim
  • Size of embedding vectors
  • Scales model size
  • Too large results in overfitting

• Epochs, learning rate, batch size, etc.

sundog-education.com datacumulus.com

108
Q

IP Insights: Instance Types

A

• CPU or GPU

GPU recommended

Ml.p3.2xlarge or higher

Can use multiple GPU’s

Size of CPU instance depends on vector_dim and num_entity_vectors

109
Q

Reinforcement Learning

A
  • You have some sort of agent that “explores” some space
  • As it goes, it learns the value of different state changes in different conditions • Those values inform subsequent behavior of the agent

• Examples: Pac-Man, Cat & Mouse game (game AI)

Supply chain management

HVAC systems

Industrial robotics

Dialog systems

Autonomous vehicles

• Yields fast on-line performance once the space has been explored

110
Q

MXNet

A

Apache MXNet (MXNet) is an open source deep learning framework that allows you to define, train, and deploy deep neural networks on a wide array of platforms, from cloud infrastructure to mobile devices. It is highly scalable, which allows for fast model training, and it supports a flexible programming model and multiple languages.

111
Q

Caffe

A

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR) and by community contributors.

112
Q

Q-Learning

A

• A specific implementation of reinforcement learning

  • You have:
  • A set of environmental states s
  • A set of possible actions in those states a • A value of each state/action Q
  • Start off with Q values of 0
  • Explore the space
  • As bad things happen after a given state/action, reduce its Q • As rewards happen after a given state/action, increase its Q
113
Q

Impact of Batch Size

A

large batch sizes train faster but at danger of getting stuck at a local minimum

114
Q

Impact of learning rate

A

Increase in learning rate has the tendency to overshoot the solution

115
Q

CloudWatch

A

CloudWatch is a repository of performance metrics associated with your endpoints, which SageMaker can use to determine if you have the right amount of them.

116
Q

Elastic Inference

A

EI accelerators can be attached to CPU inference instances to accelerate deep learning inference at a fraction of the cost of using a GPU inference node.

Accelerates deep learning inference, only works with Deep Learning

  • At fraction of cost of using a GPU instance for inference

EI accelerators may be added alongside a CPU instance

  • ml.eia1.medium / large / xlarge

EI accelerators may also be applied to notebooks

Works with Tensorflow and MXNet pre-built containers

  • ONNX may be used to export models to MXNet

Works with custom containers built with EI- enabled Tensorflow or MXNet

Works with Image Classification and Object Detection built-in algorithms

117
Q

SVM

A

Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVM can solve linear and non-linear problems and work well for many practical problems. SVM creates a line or a hyperplane which separates the data into classes.

SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification.

118
Q
A
119
Q

Latent Dirichlet Allocation

A
  • Another topic modeling algorithm • Not deep learning
  • Unsupervised
  • The topics themselves are unlabeled; they are just groupings of documents with a shared subset of words
  • Can be used for things other than words
  • Cluster customers based on purchases • Harmonic analysis in music

sundog-education.com

120
Q

LDA: What training input does it expect?

A

Train channel, optional test channel

recordIO-protobuf or CSV

Each document has counts for every word in vocabulary (in CSV format)

Pipe mode only supported with recordIO

121
Q

LDA: How is it used?

A
  • Unsupervised; generates however many topics you specify
  • Optional test channel can be used for scoring results
  • Per-word log likelihood
  • Functionally similar to NTM, but CPU-based • Therefore maybe cheaper / more efficient
122
Q

LDA: Important Hyperparameters

A
  • Num_topics
  • Alpha0
  • Initial guess for concentration parameter
  • Smaller values generate sparse topic mixtures
  • Larger values (>1.0) produce uniform mixtures
123
Q

Poisson Distribution

A

Poisson Distribution to do interesting things like finding the probability of a number of events in a time period or finding the probability of waiting some time until the next event.

The Poisson Distribution probability mass function gives the probability of observing k events in a time period given the length of the period and the average events per time:

124
Q

S3

A

Backbone for many AWS ML services (example: SageMaker)

Create a “Data Lake”

  • Infinite size, no provisioning
  • 99.999999999% durability
  • Decoupling of storage (S3) to compute (EC2, Amazon Athena, Amazon Redshift Spectrum, Amazon Rekognition, and AWS Glue)

Centralized Architecture

Object storage => supports any file format

Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf

125
Q

AWS S3 Data Partitioning

A

AWS S3 Data Partitioning

Pattern for speeding up range queries (ex: AWS Athena)

By Date: s3://bucket/my-data- set/year/month/day/hour/data_00.csv

By Product: s3://bucket/my-data-set/product-id/data_32.csv

You can define whatever partitioning strategy you like!

Data partitioning will be handled by some tools we use (e.g. AWS Glue)

126
Q

S3 Durability

A
127
Q

S3 Availability

A

measures how availabe a service is

example S3 standard: 99.99 % not available 53 min pr year

128
Q

S3 Infrequent Access

A

S3 Storage Classes – Infrequent Access

  • For data that is less frequently accessed, but requires rapid access when needed
  • Lower cost than S3 Standard
  • Amazon S3 Standard-Infrequent Access (S3 Standard-IA) • 99.9% Availability
  • Use cases: Disaster Recovery, backups
  • Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA)
  • High durability (99.999999999%) in a single AZ; data lost when AZ is destroyed
  • 99.5% Availability
  • Use Cases: Storing secondary backup copies of on-premise data, or data you can recreate
129
Q

S3 Storage

A

S3 Storage Classes

Amazon S3 Standard - General Purpose

Amazon S3 Standard-Infrequent Access (IA)

Amazon S3 One Zone-Infrequent Access

Amazon S3 Glacier Instant Retrieval

Amazon S3 Glacier Flexible Retrieval

Amazon S3 Glacier Deep Archive

Amazon S3 Intelligent Tiering

Can move between classes manually or using S3 Lifecycle configurations

130
Q

S3 Security - Other

A

Networking - VPC Endpoint Gateway:

  • Allow traffic to stay within your VPC (instead of going through public web)
  • Make sure your private services (AWS SageMaker) can access S3
  • Very important for AWS ML Exam

Logging and Audit:

  • S3 access logs can be stored in other S3 bucket
  • API calls can be logged in AWS CloudTrail

Tagged Based (combined with IAM policies and bucket policies)

  • Example: Add tag Classification=PHI to your objects
131
Q

Apache Kafka

A

Apache Kafka is a distributed event store and stream-processing platform.

It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Kafka can connect to external systems (for data import/export) via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a “message set” abstraction that naturally groups messages together to reduce the overhead of the network roundtrip.

This “leads to larger network packets, larger sequential disk operations, contiguous memory blocks […] which allows Kafka to turn a bursty stream of random message writes into linear writes.”[4]

132
Q

Amazon Kinesis

A

Kinesis is a managed alternative to Apache Kafka

  • Great for application logs, metrics, IoT, clickstreams
  • Great for “real-time” big data
  • Great for streaming processing frameworks (Spark, NiFi, etc…)
  • Data is automatically replicated synchronously to 3 AZ
  • Kinesis Streams: low latency streaming ingest at scale
  • Kinesis Analytics: perform real-time analytics on streams using SQL
  • Kinesis Firehose: load streams into S3, Redshift, ElasticSearch & Splunk
  • Kinesis Video Streams: meant for streaming video in real-time
133
Q

Amazon Redshift

A

Amazon Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, easy, and secure analytics at scale.

Use Redshift spectrum to query data directly in S3 (no loading)

OLAP: online analytical processing

134
Q

Firehose vs Streams

A

Firehose
• Fully managed, send to S3, Splunk, Redshift, ElasticSearch• Serverless data transformations with Lambda
• Near real time (lowest buffer time is 1 minute)
• Automated Scaling
No data storage

————————

Streams
• Going to write custom code (producer / consumer)
• Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out) • Automatic scaling with On-demand Mode
• Data Storage for 1 to 365 days, replay capability, multi consumers

135
Q

AWS Glue

A

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

AWS Glue provides all the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

136
Q

Amazon Athena

A

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

137
Q

AWS Data Pipeline vs Glue

A

Glue:

  • Glue ETL - Run Apache Spark code, Scala or Python based, focus on the ETL
  • Glue ETL - Do not worry about configuring or managing the resources
  • Data Catalog to make the data available to Athena or Redshift Spectrum

Data Pipeline:

  • Orchestration service: does not perform the tasks for you
  • More control over the environment, compute resources that run code, & code
  • Allows access to EC2 or EMR instances (creates resources in your own account) vs all Glue resources belong to AWS
138
Q

Spark

A

Spark is a general-purpose distributed processing system used for big data workloads. It has been deployed in every type of big data use case to detect patterns, and provide real-time insight.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance

Can be included in SageMaker

139
Q

Pipe Mode

A

Amazon SageMaker built-in algorithms support Pipe mode

  • for fetching datasets in CSV format from Amazon Simple Storage Service (S3)
  • into Amazon SageMaker while training machine learning (ML) models.

With Pipe input mode, the data is streamed directly to the algorithm container while model training is in progress.

Using Pipe mode your training jobs start faster, use significantly less disk space and finish sooner. This reduces your overall cost to train machine learning models.

140
Q

Impact of batch size on model

A

using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize., hence

  • reduce batch size and
  • reduce learning rate to avoid getting stuck in local minima

batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need

141
Q

Dealing with ighly imbalnaced datasets

A

Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more

cases as the positive class, which is fraud in this case

142
Q

Kinesis Overview

A

Kinesis Data Stream: create real-time machine learning applications

Kinesis Data Firehose: ingest massive data near-real time

Kinesis Data Analytics: real-time ETL / ML algorithms on streams

Kinesis Video Stream: real-time video stream to create ML applications

143
Q

DynamoDB:

A
  • NoSQL data store, serverless, provision read/write capacity
  • Useful to store a machine learning model served by your application
144
Q

S3:

A

S3:
• Object storage
• Serverless, infinite storage

• Integration with most AWS Services

145
Q

EMR Cluster

A

Master node: manages the cluster • Single EC2 instance

Core node: Hosts HDFS data and runs tasks • Can be scaled up & down, but with

some risk
Task node: Runs tasks, does not host data

No risk of data loss when removing

Good use of spot instances

Core node

Core node

146
Q

EMR Usage

A
  • Transient vs Long-Running Clusters
  • Can spin up task nodes using Spot instances for temporary capacity • Can use reserved instances on long-running clusters to save $
  • Connect directly to master to run jobs
  • Submit ordered steps via the console
  • EMR Serverless lets AWS scale your nodes automatically
147
Q

Linear Learner: automatic model tuning

A

When you use automatic model tuning, the linear learner internal tuning mechanism is turned off automatically.

This sets the number of parallel models, num_models, to 1. The algorithm ignores any value that you set for num_models.

148
Q

Amazon Rekognition

A

Computer vision

  • Object and scene detection
  • Can use your own face collection
  • Image moderation
  • Facial analysis
  • Celebrity recognition
  • Face comparison
  • Text in image
  • Video analysis
  • Objects / people / celebrities marked on timeline
  • People Pathing

sundog-education.com datacumulus.com
© 2022 All Rights Reserved Worldwide

Rekognition: The Nitty Gritty

  • Images come from S3, or provide image bytes as part of request
  • S3 will be faster if the image is already there
  • Facial recognition depends on good lighting, angle, visibility of eyes, resolution
  • Video must come from Kinesis Video Streams• H.264 encoded
  • 5-30 FPS
  • Favor resolution over framerate

• Can use with Lambda to trigger image analysis upon upload

149
Q

Why would you want L1?

A
  • Feature selection can reduce dimensionality
  • Out of 100 features, maybe only 10 end up with non-zero coefficients! • The resulting sparsity can make up for its computational inefficiency

• But, if you think all of your features are important, L2 is probably a better choice.

150
Q

Rekognition

A

Rekognition cannot be used to create labels for training dat

151
Q

Specificity

A

If the model has a high specificity, it implies that all false positives (think of it as false alarms) have been weeded out. In other words, the specificity of a test refers to how well the test identifies those who have not indulged in substance abuse.

152
Q
A
153
Q

Deploying Trained Models

A

Deploying Trained Models

  • Save your trained model to S3
  • Can deploy two ways:
    • Persistent endpoint for making individual predictions on demand
    • SageMaker Batch Transform to get predictions for an entire dataset

Lots of cool options

  • Inference Pipelines for more complex processing
  • SageMaker Neo for deploying to edge devices
  • Elastic Inference for accelerating deep learning models
  • Automatic scaling (increase # of endpoints as needed)
154
Q

Linear Learner Hyperparameters

A

Balance_multiclass_weights

  • Gives each class equal importance in loss functions

Learning_rate, mini_batch_size

L1

  • Regularization

Wd

  • Weight decay (L2 regularization)
155
Q
A