AWS Machine Learning_ part 1 B Flashcards
AWS Glue
Glue cannot write the output in RecordIO-Protobuf format.
Python Shell supports Glue jobs relying on libraries such as numpy, pandas and sklearn.
Kinesis Data Firehose
Kinesis Data Firehose can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics. It is not meant to be used for batch processing use cases and it cannot write data in RecorIO-Protobuf format.
Amazon Redshift
The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or multiple data sources on remote hosts. COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.
Precision-Recall Area-Under-Curve (PR AUC)
This is an example where the dataset is imbalanced with fewer instances of positive class because of a fewer number of actual fraud records in the dataset. In such scenarios where we care more about the positive class, using PR AUC is a better choice, which is more sensitive to the improvements for the positive class.
PR AUC is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.
SageMaker security
SageMaker supports specific actions, resources, and condition keys.
Administrators can use AWS JSON policies to specify who has access to what. That is, which principal can perform actions on what resources, and under what conditions. The Action element of a JSON policy describes the actions that you can use to allow or deny access in a policy.
SageMaker does not support resource-based policies, so this option is incorrect.
SageMaker does not support service-linked roles,
XGBoost required input
XGBoost actually requires LibSVM or CSV input, not RecordIO.
Amazon Elasticsearch Service
Amazon Elasticsearch Service is a managed service that makes it easy to deploy, operate, and scale Elasticsearch in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and click stream analytics
NOT a recommendation engine
Elbow Method
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
ROC curve (receiver operating characteristic curve)
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate
False Positive Rate
EMR Cluster
Master node: manages the cluster
- Single EC2 instance
Core node: Hosts HDFS data and runs tasks
- Can be scaled up & down, but with some risk
Task node: Runs tasks, does not host data
- No risk of data loss when removing
- Good use of spot instances
Data prep on SageMaker
Data usually comes from S3
- Ideal format varies with algorithm – often it is RecordIO / Protobuf for pre-built models
Can also ingest from Athena, EMR, Redshift and Amazon Keyspaces DB
- Apache Spark integrates with SageMaker
- Scikit_learn, numpy, pandas all at your disposal within a notebook
Training on SageMaker
Create a training job
- URL of S3 bucket with training data • ML compute resources
- URL of S3 bucket for output
- ECR path to training code
Training options
- Built-in training algorithms
- Spark MLLib
- Custom Python Tensorflow / MXNet code
- Your own Docker image
- Algorithm purchased from AWS marketplace
Gradient descent (GD)
Gradient descent (GD) is an iterative first-order optimisation algorithm used to find a local minimum/maximum of a given function. This method is commonly used in machine learning (ML) and deep learning(DL) to minimise a cost/loss function (e.g. in a linear regression
LIBSVM format
MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR . It is a text format in which each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 …
How Containers Serve Requests
Containers need to implement a web server that:
- responds to /invocations and /ping on port 8080.
Converting to LibSVM format
Neither Glue ETL nor Kinesis Analytics can convert to LibSVM format,
MLlib
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.
Apache NiFi
Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination.
Kinesis Anaytics
input from either Kinesis Data streams or Kinesis Data Firehose perform analytics in SQL language, define how we want to change or modify the stream, e.g. some aggregation or windowing or etc..then join it with some refernece data from s3
feed to analytics tools or output destinations
Use cases
- Streaming ETL: select columns, make simple transformations, on streaming data
- Continuous metric generation: live leaderboard for a mobile game
- Responsive analytics: look for certain criteria and build alerting (filtering)
Machine Learning on Kinesis Data Analytics
Random Cut Forest
- SQL function used for anomaly detection on numeric columns in a stream
- Example: detect anomalous subway ridership during the NYC marathon
- Uses recent history to compute model
HOTSPOTS
- locate and return information about relatively dense regions in your data
- Example: a collection of overheated servers in a data center
Estimator
SageMaker Estimator fit(inputs) method executes the training script. Estimator hyperparameters and fit method inputs are provided as its command line arguments. The training script saves the model artifacts in the /opt/ml/model once the training is completed
The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution.
Kinesis Data Analytics
Kinesis Data Analytics cannot directly ingest incoming stream data
Kinesis Data Analytics cannot directly write data into S3,