AWS Machine Learning_part 1 A Flashcards

1
Q

Glue ETL Transformation

A

Machine Learning Transformations
FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
• Format conversions: CSV, JSON, Avro, Parquet, ORC, XML • Apache Spark transformations (example: K-Means)

Bundled Transformations:
• DropFields, DropNullFields – remove (null) fields
• Filter – specify a function to filter records
• Join – to enrich data
• Map - add fields, delete fields, perform external lookups

Format conversions: CSV, JSON, Avro, Parquet, ORC, XML • Apache Spark transformations (example: K-Means)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

AWS Data Stores for Machine Learning

A

Redshift: Data Warehousing

RDS Aurora: Relational Store, SQL (OLTP - ONline Transaction Processing)

Dynamo DB: No SQL data sotre, serverless

S3: Object Storage

Open Search (previosuly Elastic Search Indexing of data, Search amongst data points • Clickstream Analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon SageMaker NTM

A

unsupervised learning algorthm that is used to organize a corpus of documents into topics that contain workd groupings based o their statistical distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Apache Parquet

A

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop, more efficient than csv

cannot open it just by using a text editor.

CSV files are row based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop

A

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Apache Zeppelin

A

Apache Zeppelin

Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Spark MLLib

A

Spark Machine learning library

Classification: logistic regression, naïve Bayes

Regression

Decision trees

Recommendation engine (ALS)

Clustering (K-Means)

LDA (topic modeling)

ML workflow utilities (pipelines, feature transformation, persistence)

SVD, PCA, statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

EMR Notebooks

A

Similar concept to Zeppelin, with more AWS integration

Notebooks backed up to S3

Provision clusters from the notebook!

Hosted inside a VPC

Accessed only via AWS console

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

LDA Latent Dirichlet

A

is a “bag-of-words” model, which means that the order of words does not matter. LDA is a generative model where each document is generated word-by-word by choosing a topic mixture θ ∼ Dirichlet(α).

A statistical model for discovering the abstract topics aka topic modeling.

unsupervised classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Amazon Redshift Spectrum

A

Amazon Redshift Spectrum is a feature within Amazon Web Services’ Redshift data warehousing service that lets a data analyst conduct fast, complex analysis on objects stored on the AWS cloud. With Redshift Spectrum, an analyst can perform SQL queries on data stored in Amazon S3 buckets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Regression metrics

A

Regression models have continuous output. So, we need a metric based on calculating some sort of distance between predicted and ground truth.

In order to evaluate Regression models, we’ll discuss these metrics in detail:

Mean Absolute Error (MAE),

Mean Squared Error (MSE),

Root Mean Squared Error (RMSE),

R² (R-Squared).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Optional Hyperparamters for Blazing Text

A

buckets - This represents the number of hash buckets to use for subwords.

epochs - This represents the number of complete passes through the training data.

learning_rate - This represents the step size used for parameter updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A plumbing company wants to better predict the sales of its flagship copper tubing for the next year. The sales data has copper tubing sizes captured as XS, S, M, L, XL and the retail price of the copper tubing varies with the size.

Which of the following data preparation steps need to be followed for the copper tubing size before it goes into the regression model for prediction?

A

Categorical Encoding

Most of the machine learning algorithms can not handle categorical variables unless we convert them to numerical values. Many algorithm’s performances vary based on how categorical variables are encoded.

Categorical variables can be divided into two categories: Nominal (no particular order) and Ordinal (follow some order).

As pricing varies with the size, we need to carry out categorical encoding that is representative of the size of the copper tubing. An example could be like so:

XS -> 2

S -> 4

M -> 7

L -> 10

XL -> 12

Here is a good reference for understanding categorical encoding:

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

One-hot Encoding

A
  • A one hot encoding is a representation of categorical variables as binary vectors. Initially, categorical values are mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

One-hot encoding would not capture the price variance with respect to size, so this option is incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Quantile Binning

Interval Binning

A

Binning: Binning is the process of converting numeric data into categorical data. It is one of the methods used in feature engineering. Binning comes in very handy for numeric features, especially when it is one with wide range.

Quantile Binning: Quantiles are values that divide the data into equal portions. For example, the median divides the data in halves; half the data are smaller, and half larger than the median. The quartiles divide the data into quarters, the deciles into tenths, etc.

Interval Binning: Each bin contains a specific numeric range. For example, we can group a person’s age into decades: 0–9 years old will be in bin 1, 10–19 years fall will be in bin 2.

Binning (of any type) is not relevant in this case, so both these options are incorrect.

Please refer to the link below for more information on binning:

https://medium.com/hacktive-devs/feature-engineering-in-machine-learning-part-1-a3904769cd93

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

XGBoost

A

XGBoost algorithm cannot be used to build a click prediction system.

XG Boost is an extreme gradient boosting algorithm that is optimized for boosted decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Kinesis Data Analytics

A

Kinesis Data Analytics enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The four most common use cases are

  • streaming extract-transform-load (ETL),
  • continuous metric generation,
  • responsive real-time analytics, and
  • interactive querying of data streams.
  • Kinesis Data Analytics cannot directly consume the incoming video stream data.

can be used ONLY on streaming data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

PCA

K-means

A

unsupervised dimensionality reduction techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Imputing Missing Data: Machine Learning

A

KNN: Find K “nearest” (most similar) rows and average their values Assumes numerical data, not categorical

There are ways to handle categorical data (Hamming distance), but categorical data is probably better served by…

Deep Learning Build a machine learning model to impute data for your machine learning model!

Works well for categorical data. Really well. But it’s complicated.

Regression

Find linear or non-linear relationships between the missing feature and other features

Most advanced technique: MICE (Multiple Imputation by Chained Equations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Linear Discriminant Analysis (LDA)

A

Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction techniques in machine learning to solve more than two-class classification problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Residuals

A

A residual for an observation in the evaluation data is the difference between the true target and the predicted target.

Residuals represent the portion of the target that the model is unable to predict. A positive residual indicates that the model is underestimating the target (the actual target is larger than the predicted target). A negative residual indicates an overestimation (the actual target is smaller than the predicted target).

The residuals plot would indicate any trend of underestimation or overestimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Latent Dirchlet Allocation (LDA)

A

The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics.

You can use LDA to figure out the right categories for each product.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Amazon Algos

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Other ways to generate training labels

A

Rekognition
• AWS service for image recognition • Automatically classify images

Comprehend
• AWS service for text analysis and topic modeling • Automatically classify text by topics, sentiment

• Any pre-trained model or unsupervised technique that may be helpful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

MICE

A

Multiple Imputation by Chained Equations finds relationships between features and is one of the most advanced imputation methods available. Using machine learning techniques such as KNN and deep learning are also good approaches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Deep Learning Frameworks

A

Apache MXNet

  • is an open-source deep learning software framework, used to train and deploy deep neural networks.
  • It is scalable, allowing for fast model training and supports a flexible programming model and multiple programming languages

Tensorflow / Keras: Google

supporte on AWS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Activation Function

A

Define the output of a node/neuron given its input signals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

RMSE

A

For regression tasks, Amazon ML uses the industry standard root mean square error (RMSE) metric. It is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the model. A model with perfectly correct predictions would have an RMSE of 0.

only gives magnitude of error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

AUC

A

Amazon ML provides an industry-standard accuracy metric for binary classification models called Area Under the (Receiver Operating Characteristic) Curve (AUC). AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold.

The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality (‘0’s are predicted as ‘1’s and vice versa).

AUC metric is used for classification models, so this option is not the right fit for the given use-case.

Reference:

https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-insights.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Amazon FSx

A

Amazon FSx makes it easy and cost effective to launch, run, and scale feature-rich, high-performance file systems in the cloud. It supports a wide range of workloads with its reliability, security, scalability, and broad set of capabilities. Amazon FSx is built on the latest AWS compute, networking, and disk technologies to provide high performance and lower TCO. And as a fully managed service, it handles hardware provisioning, patching, and backups – freeing you up to focus on your applications, your end users, and your business.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Amazon FSx for Lustre

A

Amazon FSx for Lustre, a file system service. FSx for Lustre speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds. The first time you run a training job, FSx for Lustre automatically copies data from Amazon S3 and makes it available to Amazon SageMaker. You can use the same Amazon FSx file system for subsequent iterations of training jobs, preventing repeated downloads of common Amazon S3 objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Amazon Elastic File System (Amazon EFS),

A

if your training data is already inAmazon Elastic File System (Amazon EFS), we recommend using that as your training data source. Amazon EFS has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with including different fields or labels in their dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Amazon Data Lake Formation

A

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake lets you break down data silos and combine different types of analytics to gain insights and guide better business decisions.

34
Q

Amazon EBS

A

Amazon Elastic Block Store (Amazon EBS) is an easy-to-use, scalable, high-performance block-storage service designed for Amazon Elastic Compute Cloud (Amazon EC2).

EBS data back ups are not highly available

35
Q

Amazon Kinesis

A

is a platform for streaming data on AWS

36
Q

Amazon Kinesis Data Firehose

A

as data is ingested real time you can use Firhose to batch and compress data to generate incremental views.

also allows you to use custom transformation logic using AWS Lamda before delivering incremental view to S3

most common is it reads records from Kinesis Data Streams up to 1 MB at a time then may transform with Lambda and the write a big part of data into a batch to write to a target database, hence near real time

also can deliver to custom destionations as long as there is a valid HTTP Endpoint

37
Q

Apache Spark on Amazon EMR

A

Apache Spark on Amazon EMR

provides a managed framework that can process massive quantities of data. Amazon EMR supports many instance types that have proportionally high CPU with increased network performance, which is well suited for HPC (high-performance computing) applications.

38
Q

ETL processing services

A

Amazon Athena

AWS Glue

Amazon Redshift Spectrum

functionally complementary and can be built to preprocess datasets stored in or targeted to Amazon S3. In addition to transforming data with services like Athena and Amazon Redshift Spectrum, you can use services like AWS Glue to provide metadata discovery and management features. The choice of ETL processing tool is also largely dictated by the type of data you have. For example, tabular data processing with Athena lets you manipulate your data files in Amazon S3 using SQL.

39
Q

Kinesis Client Library

A

KCL consumer application – an application that is custom-built using KCL and designed to read and process records from data streams.

40
Q

Dimensionality reduction techniques

A
  • principal component analysis (PCA) or
  • t-distributed stochastic neighbor embedding.
41
Q

Scaling/Normalizing transformations

A

Mean/variance standardization

MinMax scaling

Maxabs scaling

Robust scaling

Normalizer

42
Q

Classification problems

A

Amazon SageMaker provides a few built-in algorithms that work for these situations:

  • Linear Learner,
  • XGBoost, and
  • K-Nearest Neighbors.

XGBoost, for instance, is an open-source implementation of the gradient-boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models.

For Regression: In terms of the built-in Amazon SageMaker algorithms you could choose, it’s pretty similar. Again, you could choose Linear Learner and XGBoost. The difference is that you set the hyperparameters to direct these algorithms to produce quantitative results.

43
Q

Amazon SageMaker built-in algorithms for NLP

A

BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms.

Sequence2sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens.

Object2Vec generalizes the well-known Word2Vec embedding technique for words that are optimized in the Amazon SageMaker BlazingText algorithm.

44
Q

Amazon SageMaker built-in algorithms for computer vision:

A

Image classification is a supervised learning algorithm used to classify images.

Object detection algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. The object is categorized into one of the classes in a specified collection with a confidence score that it belongs to the class. Its location and scale in the image are indicated by a rectangular bounding box.

Semantic segmentation algorithm tags every pixel in an image with a class label from a predefined set of classes.

45
Q

Other options for training algorithms

A

Up to now, the focus has been exclusively on Amazon SageMaker built-in algorithms, but there are other options for training algorithms:

  • Apache Spark with Amazon SageMaker
  • Submit custom code to train a model with a deep learning framework like TensorFlow or Apache MXNet
  • Use your own custom algorithm and put the code together as a Docker image
  • Subscribe to an algorithm from AWS Marketplace
46
Q

cross-validation

A

Use cross-validation methods to compare the performance of multiple models. The goal behind cross-validation is to help you choose the model that will eventually perform the best in production.

47
Q

K-fold cross-validation

A

K-fold cross-validation is a common validation method. In k-fold cross-validation, you split the input data into k subsets of data (also known as folds). You train your models on all but one (k-1) of the subsets, and then evaluate them on the subset that was not used for training. This process is repeated k times, with a different subset reserved for evaluation (and excluded from training) each time.

For instance, performing a 5-fold cross-validation generates four models, four datasets to train the models, four datasets to evaluate the models, and four evaluations, one for each model. In a 5-fold cross-validation for a binary classification problem, each of the evaluations reports an area under curve (AUC) metric. You can get the overall performance measure by computing the average of the four AUC metrics.

48
Q

Steps for creating a training job in SageMaker

A

Creating a training job in Amazon SageMaker typically requires the following steps

1) specifiy URl’s for S3 bucket for training data and output
2) specify Compute Resources
3) specify Amazon Elastic Container Registry Path for training code

49
Q

Hyper parameters

A

Model Hyperparameters: define the model itself, attributes of a neural network architecture:such as filter size, pooling stride, padding

Optimizer Hyperparameters: related to how the model learn the patterns based on data and are based on data and are used for a neural network model. inlcude optimizers like gradient descent. or optimizers using momentum like Adam

Data Hyperparameters: related to the attributes of the data, often usd when you do not have enough data or enough variaiton in the data

50
Q

Automated hyperparameter tuning

A

which uses methods like:

Gradient descent

Bayesian optimization, and

evolutionary algorithms to conduct a guided search for the best hyperparameter settings.

51
Q

Accuracy

A

Ratio of correct Predictions over total predictions

52
Q

Specificity

A

Specificity = (True Negative)/(True Negative + False Positive)

53
Q

Kinesis Data Streams vs Firehose

A

Streams
• Going to write custom code (producer / consumer)
• Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out) • Automatic scaling with On-demand Mode
• Data Storage for 1 to 365 days, replay capability, multi consumers

build real time applications that can be replayed

Firehose: data delivery service
Fully managed, send to S3, Splunk, Redshift, ElasticSearch• Serverless data transformations with Lambda
• Near real time (lowest buffer time is 1 minute)
• Automated Scaling
No data storage

Process and deliver data

54
Q

Sagemaker

A

cannot be used with streamed data

cannot split data automatically into train, validate, train

55
Q

QuickSight

A

can only be used with structured datasets that need to be stored in Amazon S3 or a database

reads directly from S3

QuickSight’s ML Insights feature allows forecasting using QuickSight itself. This is a serverless solution that contains the least number of components.

56
Q

Standard Scaler

A

performs scaling and shifting/centering

only works with numerical data

57
Q

Bayesian Hyperparameter Optimizaiton

A

obtains better results in fewer evaluations compared to grid search and random search

58
Q

SGD (stochastic gradient descent Model taking too long to converge: what optimizer?

A

Adam: adaptive momentum, helps the model converge faster and get out of being stuck in local minima.

Adagrad: algorithm for gradient based optimisation that adapts the learning rate to the parameters by performing smaller updates and in turn, helps with convergence.

RMSProp: moving average of squared gradients to normalize the gradient itself

Minibatch same problems as SGD

Gradient descent takes longer than SGD

59
Q

RFE Recursive Feature Elimination

A

RFE is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable. Hence it allows you to choose the best features without modifying the existing features, which is a key requirement of the given scenario, as the Sales department wants to interpret the model and then determine the direct effect of significant characteristics on the model’s output.

60
Q

Instance vs semantic segmentation

A

Semantic segmentation associates every pixel of an image with a class label such as a person, flower, car and so on. It treats multiple objects of the same class as a single entity. It will not detect each distinct object in the image

In contrast, instance segmentation treats multiple objects of the same class as distinct individual instances.

61
Q

how to improve convergence

A

Some of the techniques to make the neural network converge faster are as follows:

  • Learning type.
  • Input normalization.
  • Activation function.
  • Learning rate.
62
Q

Momentum

A

Momentum involves adding an additional hyperparameter that controls the amount of history (momentum) to include in the update equation, i.e. the step to a new point in the search space. The value for the hyperparameter is defined in the range 0.0 to 1.0 and often has a value close to 1.0, such as 0.8, 0.9, or 0.99.

63
Q

Differences between HTTP and HTTPS

A

HTTP stands for HyperText Transfer Protocol and HTTPS stands for HyperText Transfer Protocol Secure.

In HTTP, URL begins with “http://” whereas URL starts with “https://”

HTTP uses port number 80 for communication and HTTPS uses 443

HTTP is considered to be insecure and HTTPS is secure

HTTP Works at Application Layer and HTTPS works at Transport Layer

In HTTP, Encryption is absent and Encryption is present in HTTPS as discussed above

HTTP does not require any certificates and HTTPS needs SSL Certificates

HTTP speed is faster than HTTPS and HTTPS speed is slower than HTTP

HTTP does not improve search ranking while HTTPS improves search ranking.

HTTP does not use data hashtags to secure data, while HTTPS will have the data before sending it and return it to its original state on the receiver side.

64
Q

cost function

A

The cost is the penalty that is incurred when the estimate of the target provided by the ML model does not equal the target exactly. A cost function quantifies this penalty as a single value. An optimization technique seeks to minimize the loss.

65
Q

SageMaker and Docker

A

All models in SageMaker are hosted in Docker containers

  • Pre-built deep learning
  • Pre-built scikit-learn and Spark ML
  • Pre-built Tensorflow, MXNet, Chainer, PyTorch
    • Distributed training via Horovod or Parameter Servers
  • Your own training and inference code! Or extend a pre-built image.
  • This allows you to use any script or algorithm within SageMaker, regardless of runtime or language
  • Containers are isolated, and contain all dependencies and resources needed to run

need to be registered on ECR

66
Q

Horovod

A

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Horovod was originally developed by Uber to make distributed deep learning fast and easy to use, bringing model training time down from days and weeks to hours and minutes. With Horovod, an existing training script can be scaled up to run on hundreds of GPUs in just a few lines of Python code

67
Q

Parameter Server

A

Parameter servers are a core part of many machine learning applications. Their role is to store the parameters of a machine learning model (e.g., the weights of a neural network) and to serve them to clients

68
Q

Amazon SageMaker Containers

A

Library for making containers compatible with SageMaker

• RUN pip install sagemaker-containers in your Dockerfile

69
Q

AWS Security

A

Use Identity and Access Management (IAM):

  • Set up user accounts with only the permissions they need

Use MFA

Use SSL/TLS when connecting to anything

Use CloudTrail to log API and user activity

Use encryption

Be careful with PII

70
Q

SSL/TLS

A

TLS, short for Transport Layer Security, and

SSL, short for Secure Socket Layers

are both cryptographic protocols that encrypt data and authenticate a connection when moving data on the Internet.

71
Q

Protecting Data in Transit in SageMaker

A

All traffic supports TLS / SSL

IAM roles are assigned to SageMaker to give it permissions to access resources

Inter-node training communication may be optionally encrypted

  • Can increase training time and cost with deep learning
  • AKA inter-container traffic encryption
  • Enabled via console or API when setting up a training or tuning job
72
Q

Protecting your Data at Rest in SageMaker

A

AWS Key Management Service (KMS)

  • Accepted by notebooks and all SageMaker jobs
  • Training, tuning, batch transform, endpoints
  • Notebooks and everything under /opt/ml/ and /tmp can be encrypted with a KMS key

S3

  • Can use encrypted S3 buckets for training data and hosting models
  • S3 can also use KMS
73
Q

SageMaker + VPC

A

Training jobs run in a Virtual Private Cloud (VPC)

You can use a private VPC for even more security

  • You’ll need to set up S3 VPC endpoints
  • Custom endpoint policies and S3 bucket policies can keep this secure

Notebooks are Internet-enabled by default

  • This can be a security hole
  • If disabled, your VPC needs an interface endpoint (PrivateLink) or NAT Gateway, and allow outbound connections, for training and hosting to work

Training and Inference Containers are also Internet-enabled by default

  • Network isolation is an option, but this also prevents S3 access
74
Q

Serverless Inference

A

Fully manged serverless endpoints

Specify your container, memory requirement, concurrency requirements

Underlying capacity is automatically provisioned and scaled

Good for infrequent or unpredictable traffic; will scale down to zero when there are no requests.

Charged based on usage

Monitor via CloudWatch
• ModelSetupTime, Invocations, MemoryUtilization

75
Q

Inference Pipelines

A

Linear sequence of 2-15 containers

Any combination of pre-trained built-in algorithms or your own algorithms in Docker containers

Combine pre-processing, predictions, post-processing

Spark ML and scikit-learn containers OK • Spark ML can be run with Glue or EMR
• Serialized into MLeap format

Can handle both real-time inference and batch transforms

76
Q

Logarithm Transformation

A

The Log Transform decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust.

The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2 of x, is a strong transformation with a major effect on distribution shape. It is commonly used for reducing right skewness and is often appropriate for measured variables. It can not be applied to zero or negative values. One unit on a logarithmic scale means a multiplication by the base of logarithms being used.

77
Q

Robust Standardization

A

Standardization (or z-score normalization) scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.

A better approach to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation, then use the calculated values to scale the variable.

This is called robust standardization or robust data scaling.

78
Q

Amazon EC2 vs Amazon EMR

A

Amazon EC2 can be classified as a tool in the “Cloud Hosting” category,

Amazon EMR is grouped under “Big Data as a Service”.

Amazon EC2: Scalable, pay-as-you-go compute capacity in the cloud. Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers;

Amazon EMR:Distribute your data and processing across a Amazon EC2 instances using Hadoop. Amazon EMR is used in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. Customers launch millions of Amazon EMR clusters every year.

Amazon EMR is a web service that makes it easy for you to process and analyze vast amounts of data using applications in the Hadoop ecosystem, including Hive, Pig, HBase, Presto, Impala, and others.

79
Q

Amazon ECR

A

Amazon ECR is a fully managed container registry offering high-performance hosting, so you can reliably deploy application images and artifacts anywhere.

80
Q

Kinesis Streams Overview

A

Kinesis Streams Overview

  • Streams are divided in ordered Shards / Partitions
  • Data retention is 24 hours by default, can go up to 365 days
  • Ability to reprocess / replay data
  • Multiple applications can consume the same stream
  • Once data is inserted in Kinesis, it can’t be deleted (immutability)
  • Records can be up to 1MB in size: great use case for small amount of data going fast through the stream, but not for terrabyte batch analysis

Great for real time streaming applications. needs to be provisioned in advance.