AWS Machine Learning_part 1 A Flashcards
Glue ETL Transformation
Machine Learning Transformations
FindMatches ML: identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
• Format conversions: CSV, JSON, Avro, Parquet, ORC, XML • Apache Spark transformations (example: K-Means)
Bundled Transformations:
• DropFields, DropNullFields – remove (null) fields
• Filter – specify a function to filter records
• Join – to enrich data
• Map - add fields, delete fields, perform external lookups
Format conversions: CSV, JSON, Avro, Parquet, ORC, XML • Apache Spark transformations (example: K-Means)
AWS Data Stores for Machine Learning
Redshift: Data Warehousing
RDS Aurora: Relational Store, SQL (OLTP - ONline Transaction Processing)
Dynamo DB: No SQL data sotre, serverless
S3: Object Storage
Open Search (previosuly Elastic Search Indexing of data, Search amongst data points • Clickstream Analytics
Amazon SageMaker NTM
unsupervised learning algorthm that is used to organize a corpus of documents into topics that contain workd groupings based o their statistical distribution
Apache Parquet
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop, more efficient than csv
cannot open it just by using a text editor.
CSV files are row based
Hadoop
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model
Apache Zeppelin
Apache Zeppelin
Web-based notebook that enables data-driven,
interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.
Spark MLLib
Spark Machine learning library
Classification: logistic regression, naïve Bayes
Regression
Decision trees
Recommendation engine (ALS)
Clustering (K-Means)
LDA (topic modeling)
ML workflow utilities (pipelines, feature transformation, persistence)
SVD, PCA, statistics
EMR Notebooks
Similar concept to Zeppelin, with more AWS integration
Notebooks backed up to S3
Provision clusters from the notebook!
Hosted inside a VPC
Accessed only via AWS console
LDA Latent Dirichlet
is a “bag-of-words” model, which means that the order of words does not matter. LDA is a generative model where each document is generated word-by-word by choosing a topic mixture θ ∼ Dirichlet(α).
A statistical model for discovering the abstract topics aka topic modeling.
unsupervised classification
Amazon Redshift Spectrum
Amazon Redshift Spectrum is a feature within Amazon Web Services’ Redshift data warehousing service that lets a data analyst conduct fast, complex analysis on objects stored on the AWS cloud. With Redshift Spectrum, an analyst can perform SQL queries on data stored in Amazon S3 buckets.
Regression metrics
Regression models have continuous output. So, we need a metric based on calculating some sort of distance between predicted and ground truth.
In order to evaluate Regression models, we’ll discuss these metrics in detail:
Mean Absolute Error (MAE),
Mean Squared Error (MSE),
Root Mean Squared Error (RMSE),
R² (R-Squared).
Optional Hyperparamters for Blazing Text
buckets - This represents the number of hash buckets to use for subwords.
epochs - This represents the number of complete passes through the training data.
learning_rate - This represents the step size used for parameter updates.
A plumbing company wants to better predict the sales of its flagship copper tubing for the next year. The sales data has copper tubing sizes captured as XS, S, M, L, XL and the retail price of the copper tubing varies with the size.
Which of the following data preparation steps need to be followed for the copper tubing size before it goes into the regression model for prediction?
Categorical Encoding
Most of the machine learning algorithms can not handle categorical variables unless we convert them to numerical values. Many algorithm’s performances vary based on how categorical variables are encoded.
Categorical variables can be divided into two categories: Nominal (no particular order) and Ordinal (follow some order).
As pricing varies with the size, we need to carry out categorical encoding that is representative of the size of the copper tubing. An example could be like so:
XS -> 2
S -> 4
M -> 7
L -> 10
XL -> 12
Here is a good reference for understanding categorical encoding:
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
One-hot Encoding
- A one hot encoding is a representation of categorical variables as binary vectors. Initially, categorical values are mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
One-hot encoding would not capture the price variance with respect to size, so this option is incorrect.
Quantile Binning
Interval Binning
Binning: Binning is the process of converting numeric data into categorical data. It is one of the methods used in feature engineering. Binning comes in very handy for numeric features, especially when it is one with wide range.
Quantile Binning: Quantiles are values that divide the data into equal portions. For example, the median divides the data in halves; half the data are smaller, and half larger than the median. The quartiles divide the data into quarters, the deciles into tenths, etc.
Interval Binning: Each bin contains a specific numeric range. For example, we can group a person’s age into decades: 0–9 years old will be in bin 1, 10–19 years fall will be in bin 2.
Binning (of any type) is not relevant in this case, so both these options are incorrect.
Please refer to the link below for more information on binning:
https://medium.com/hacktive-devs/feature-engineering-in-machine-learning-part-1-a3904769cd93
XGBoost
XGBoost algorithm cannot be used to build a click prediction system.
XG Boost is an extreme gradient boosting algorithm that is optimized for boosted decision trees.
Kinesis Data Analytics
Kinesis Data Analytics enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The four most common use cases are
- streaming extract-transform-load (ETL),
- continuous metric generation,
- responsive real-time analytics, and
- interactive querying of data streams.
- Kinesis Data Analytics cannot directly consume the incoming video stream data.
can be used ONLY on streaming data
PCA
K-means
unsupervised dimensionality reduction techniques
Imputing Missing Data: Machine Learning
KNN: Find K “nearest” (most similar) rows and average their values Assumes numerical data, not categorical
There are ways to handle categorical data (Hamming distance), but categorical data is probably better served by…
Deep Learning Build a machine learning model to impute data for your machine learning model!
Works well for categorical data. Really well. But it’s complicated.
Regression
Find linear or non-linear relationships between the missing feature and other features
Most advanced technique: MICE (Multiple Imputation by Chained Equations)
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction techniques in machine learning to solve more than two-class classification problems
Residuals
A residual for an observation in the evaluation data is the difference between the true target and the predicted target.
Residuals represent the portion of the target that the model is unable to predict. A positive residual indicates that the model is underestimating the target (the actual target is larger than the predicted target). A negative residual indicates an overestimation (the actual target is smaller than the predicted target).
The residuals plot would indicate any trend of underestimation or overestimation.
Latent Dirchlet Allocation (LDA)
The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics.
You can use LDA to figure out the right categories for each product.
Amazon Algos
Other ways to generate training labels
Rekognition
• AWS service for image recognition • Automatically classify images
Comprehend
• AWS service for text analysis and topic modeling • Automatically classify text by topics, sentiment
• Any pre-trained model or unsupervised technique that may be helpful
MICE
Multiple Imputation by Chained Equations finds relationships between features and is one of the most advanced imputation methods available. Using machine learning techniques such as KNN and deep learning are also good approaches.
Deep Learning Frameworks
Apache MXNet
- is an open-source deep learning software framework, used to train and deploy deep neural networks.
- It is scalable, allowing for fast model training and supports a flexible programming model and multiple programming languages
Tensorflow / Keras: Google
supporte on AWS
Activation Function
Define the output of a node/neuron given its input signals
RMSE
For regression tasks, Amazon ML uses the industry standard root mean square error (RMSE) metric. It is a distance measure between the predicted numeric target and the actual numeric answer (ground truth). The smaller the value of the RMSE, the better is the predictive accuracy of the model. A model with perfectly correct predictions would have an RMSE of 0.
only gives magnitude of error
AUC
Amazon ML provides an industry-standard accuracy metric for binary classification models called Area Under the (Receiver Operating Characteristic) Curve (AUC). AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold.
The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality (‘0’s are predicted as ‘1’s and vice versa).
AUC metric is used for classification models, so this option is not the right fit for the given use-case.
Reference:
https://docs.aws.amazon.com/machine-learning/latest/dg/regression-model-insights.html
Amazon FSx
Amazon FSx makes it easy and cost effective to launch, run, and scale feature-rich, high-performance file systems in the cloud. It supports a wide range of workloads with its reliability, security, scalability, and broad set of capabilities. Amazon FSx is built on the latest AWS compute, networking, and disk technologies to provide high performance and lower TCO. And as a fully managed service, it handles hardware provisioning, patching, and backups – freeing you up to focus on your applications, your end users, and your business.
Amazon FSx for Lustre
Amazon FSx for Lustre, a file system service. FSx for Lustre speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds. The first time you run a training job, FSx for Lustre automatically copies data from Amazon S3 and makes it available to Amazon SageMaker. You can use the same Amazon FSx file system for subsequent iterations of training jobs, preventing repeated downloads of common Amazon S3 objects.
Amazon Elastic File System (Amazon EFS),
if your training data is already inAmazon Elastic File System (Amazon EFS), we recommend using that as your training data source. Amazon EFS has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with including different fields or labels in their dataset.