AWS Machine Learning_ part 1 B Flashcards

1
Q

AWS Glue

A

Glue cannot write the output in RecordIO-Protobuf format.

Python Shell supports Glue jobs relying on libraries such as numpy, pandas and sklearn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis Data Firehose

A

Kinesis Data Firehose can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics. It is not meant to be used for batch processing use cases and it cannot write data in RecorIO-Protobuf format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon Redshift

A

The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or multiple data sources on remote hosts. COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Precision-Recall Area-Under-Curve (PR AUC)

A

This is an example where the dataset is imbalanced with fewer instances of positive class because of a fewer number of actual fraud records in the dataset. In such scenarios where we care more about the positive class, using PR AUC is a better choice, which is more sensitive to the improvements for the positive class.

PR AUC is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

SageMaker security

A

SageMaker supports specific actions, resources, and condition keys.

Administrators can use AWS JSON policies to specify who has access to what. That is, which principal can perform actions on what resources, and under what conditions. The Action element of a JSON policy describes the actions that you can use to allow or deny access in a policy.

SageMaker does not support resource-based policies, so this option is incorrect.

SageMaker does not support service-linked roles,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

XGBoost required input

A

XGBoost actually requires LibSVM or CSV input, not RecordIO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Amazon Elasticsearch Service

A

Amazon Elasticsearch Service is a managed service that makes it easy to deploy, operate, and scale Elasticsearch in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and click stream analytics

NOT a recommendation engine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Elbow Method

A

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ROC curve (receiver operating characteristic curve)

A

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate

False Positive Rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

EMR Cluster

A

Master node: manages the cluster

  • Single EC2 instance

Core node: Hosts HDFS data and runs tasks

  • Can be scaled up & down, but with some risk

Task node: Runs tasks, does not host data

  • No risk of data loss when removing
  • Good use of spot instances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data prep on SageMaker

A

Data usually comes from S3

  • Ideal format varies with algorithm – often it is RecordIO / Protobuf for pre-built models

Can also ingest from Athena, EMR, Redshift and Amazon Keyspaces DB

  • Apache Spark integrates with SageMaker
  • Scikit_learn, numpy, pandas all at your disposal within a notebook
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Training on SageMaker

A

Create a training job

  • URL of S3 bucket with training data • ML compute resources
  • URL of S3 bucket for output
  • ECR path to training code

Training options

  • Built-in training algorithms
  • Spark MLLib
  • Custom Python Tensorflow / MXNet code
  • Your own Docker image
  • Algorithm purchased from AWS marketplace
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Gradient descent (GD)

A

Gradient descent (GD) is an iterative first-order optimisation algorithm used to find a local minimum/maximum of a given function. This method is commonly used in machine learning (ML) and deep learning(DL) to minimise a cost/loss function (e.g. in a linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

LIBSVM format

A

MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR . It is a text format in which each line represents a labeled sparse feature vector using the following format: label index1:value1 index2:value2 …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How Containers Serve Requests

A

Containers need to implement a web server that:

  • responds to /invocations and /ping on port 8080.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Converting to LibSVM format

A

Neither Glue ETL nor Kinesis Analytics can convert to LibSVM format,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

MLlib

A

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Apache NiFi

A

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Kinesis Anaytics

A

input from either Kinesis Data streams or Kinesis Data Firehose perform analytics in SQL language, define how we want to change or modify the stream, e.g. some aggregation or windowing or etc..then join it with some refernece data from s3

feed to analytics tools or output destinations

Use cases

  • Streaming ETL: select columns, make simple transformations, on streaming data
  • Continuous metric generation: live leaderboard for a mobile game
  • Responsive analytics: look for certain criteria and build alerting (filtering)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Machine Learning on Kinesis Data Analytics

A

Random Cut Forest

  • SQL function used for anomaly detection on numeric columns in a stream
  • Example: detect anomalous subway ridership during the NYC marathon
  • Uses recent history to compute model

HOTSPOTS

  • locate and return information about relatively dense regions in your data
  • Example: a collection of overheated servers in a data center
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Estimator

A

SageMaker Estimator fit(inputs) method executes the training script. Estimator hyperparameters and fit method inputs are provided as its command line arguments. The training script saves the model artifacts in the /opt/ml/model once the training is completed

The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Kinesis Data Analytics

A

Kinesis Data Analytics cannot directly ingest incoming stream data

Kinesis Data Analytics cannot directly write data into S3,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Kinesis Summary – Machine Learning

A

Kinesis Data Stream: create real-time machine learning applications

Kinesis Data Firehose: ingest massive data near-real time

Kinesis Data Analytics: real-time ETL / ML algorithms on streams

Kinesis Video Stream: real-time video stream to create ML applications

25
Q

Glue Data Catalog

A

Metadata repository for all your tables

  • Automated Schema Inference
  • Schemas are versioned

Integrates with Athena or Redshift Spectrum (schema & data discovery)

Glue Crawlers can help build the Glue Data Catalog

26
Q

Glue Crawlers

A
  • Crawlers go through your data to infer schemas and partitions
  • Works JSON, Parquet, CSV, relational store
  • Crawlers work for: S3, Amazon Redshift, Amazon RDS
  • Run the Crawler on a Schedule or On Demand
  • Need an IAM role / credentials to access the data stores
27
Q

Glue ETL

A

Transform data, Clean Data, Enrich Data (before doing analysis)

  • Generate ETL code in Python or Scala, you can modify the code
  • Can provide your own Spark or PySpark scripts
  • Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog

Fully managed, cost effective, pay only for the resources consumed • Jobs are run on a serverless Spark platform

Glue Scheduler to schedule the jobs
Glue Triggers to automate job runs based on “events”

28
Q

Amazon Machine Image (AMI)

A

An Amazon Machine Image (AMI) is a

  • supported and maintained image provided by AWS that provides the information required to launch an instance.
  • You must specify an AMI when you launch an instance.
  • You can launch multiple instances from a single AMI when you require multiple instances with the same configuration.
  • You can use different AMIs to launch instances when you require instances with different configurations.
29
Q

Parquet vs JSON

A

Parquet vs JSON , The JSON stores key-value format. In the opposite side, Parquet file format stores column data. So basically when we need to store any configuration we use JSON file format. While parquet file format is useful when we store the data in tabular format.

30
Q

SageMaker Endpoints

A

A single Amazon Sagemaker endpoint cannot run 2 models

31
Q

Availability and fault tolerance

A

highly available solution:

  • the system will continue to function even when any component of the architecture stops working.
  • A key aspect of high availability
  • is fault tolerance,

Fault tolerance

  • when built into an architecture, ensures that applications will continue to function without degradation in performance, despite the complete failure of any component of the architecture
32
Q

Loose coupling

A

With a loosely coupled, distributed system, the failure of one component can be managed in between your application tiers so that the faults do not spread beyond that single point.

Loose coupling is often achieved by

  • making sure application components are independent of each other.
  • For example, you should always decouple your storage layer with your compute layer because a training job only requires minimal time, but storing data is permanent. Decoupling helps turn off the compute resources when they are not needed.
33
Q

Loose Coupling

A

Queues are used in loose coupling to pass messages between components

In a general architecture, you can use a queue service like Amazon SQS or workflow service like AWS Step Functions to create a workflow between various components.

34
Q

Boto

A

A Python interface to Amazon Web Services

35
Q

Amazon SNS

A

Amazon Simple Notification Service (ASNS) is a notification service provided as part of Amazon Web Services since 2010. It provides a low-cost infrastructure for mass delivery of messages, predominantly to mobile users.

36
Q

Leveraging AWS services to design for failure

A

You should decouple your ETL process from the ML pipeline. The compute power needed for ML isn’t the same as what you’d need for an ETL process—they have very different requirements.

An ETL process needs to read in files from multiple formats, transform them as needed, and then write them back to a persistent storage. Keep in mind that reading and writing takes a lot of memory and disk I/O, so when you decouple your ETL process, use a framework like Apache Spark, which can handle large amounts of data easily for ETL.

37
Q

Designing highly available and fault-tolerant ML architectures.

A
  • You can deploy your Amazon SageMaker-built models to an Amazon SageMaker endpoint.
  • Once created, you need to invoke the endpoint outside the Amazon SageMaker notebook instance with appropriate input (the model signature). These input parameters can be in a file format such as CSV and LIBSVM, as well as an audio, image, or video file.
  • You can use AWS Lambda and Amazon API Gateway to format the input request and invoke the endpoint from the web. The diagram on the slide shows the infrastructure architecture at this point.
38
Q

Sagemaker Endpoints

A
  • If the endpoint has only a moderate load, you can run it on a single instance and still get good performance.
  • Use automatic scaling to ensure high availability during traffic fluctuations without having to constantly provision for peak traffic.
  • For production workloads, use at least two instances. Because Amazon SageMaker automatically spreads the endpoint instances across multiple Availability Zones, a minimum of two instances ensures high availability and provides individual fault tolerance.
39
Q

P3

A

Amazon EC2 P3 instances: GPU

  • deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs
  • and up to 100 Gbps of networking throughput for machine learning and HPC applications.
40
Q

HPC

A

High Performance Computing

41
Q

AWS Deep Learning Containers (AWS DL Containers)

A

AWS Deep Learning Containers (AWS DL Containers)

  • Docker images pre-installed with deep learning frameworks to make it easy to deploy custom machine learning (ML) environments quickly by letting you skip the complicated process of building and optimizing your environments from scratch.
  • AWS DL Containers support TensorFlow, PyTorch, Apache MXNet.
  • You can deploy AWS DL Containers on Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), self-managed Kubernetes on Amazon EC2, Amazon Elastic Container Service (Amazon ECS).
  • The containers are available through Amazon Elastic Container Registry (Amazon ECR) and AWS Marketplace at no cost–you pay only for the resources that you use.
42
Q

Amazon SageMaker Spark containers

A

Amazon SageMaker now supports Apache Spark as a pre-built big data processing container. You can now use this container with Amazon SageMaker Processing and take advantage of a fully managed Spark environment for data processing or feature engineering workloads.

43
Q

Security in Amazon SageMaker

A

Amazon SageMaker supports

  • IAM role-based access to secure your artifacts in Amazon S3, where you can set different roles for different parts of the process.
  • For instance, a certain data scientist can have access to PII information in the raw data bucket, but the DevOps engineer only has access to the trained model itself.
  • This approach helps you restrict access to the user(s) who need it.
  • For the data scientist, you can use a notebook execution role for creating and deleting notebooks, and a training job execution role to run the training jobs.
44
Q

SageMaker Security

A

Amazon SageMaker also encrypts data at rest

Along with IAM roles to prevent unwanted access, Amazon SageMaker also encrypts data at rest with either

  • AWS Key Management Service (AWS KMS)
  • or a transient key if the key isn’t provided
  • and in transit with TLS 1.2 encryption for the all other communication. Users can connect to the notebook instances using an AWS SigV4 authentication so that any connection remains secure. Any API call you make is executed over an SSL connection.
45
Q

Create a model in Amazon SageMaker

A

You need:

  • The Amazon S3 path where the model artifacts are stored
  • The Docker registry path for the image that contains the inference code
  • A name that you can use for subsequent deployment steps
46
Q

Create an endpoint configuration for an HTTPS endpoint

A

You need:

  • The name of one or more models in production variants
  • The ML compute instances that you want Amazon SageMaker to launch to host each production variant.
  • When hosting models in production, you can configure the endpoint to elastically scale the deployed ML compute instances.
  • For each production variant, you specify the number of ML compute instances that you want to deploy. When you specify two or more instances, Amazon SageMaker launches them in multiple Availability Zones. This ensures continuous availability. Amazon SageMaker manages deploying the instances.
47
Q

Create an HTTPS endpoint

A

You need to provide the endpoint configuration to Amazon SageMaker. The service launches the ML compute instances and deploys the model or models as specified in the configuration.

48
Q

Define and apply a scaling policy that uses Amazon CloudWatch metrics

A
  • Automatic scaling uses the policy to adjust the number of instances up or down in response to actual workloads
  • You can use the AWS Management Console to apply a scaling policy based on a predefined metric
  • A predefined metric is defined in an enumeration so you can specify it by name in code or use it in the console
  • Always load-test your automatic scaling configuration to ensure that it works correctly before using it to manage production traffic
49
Q

SageMaker Model deployment

A

Amazon SageMaker model cannot be called directly using AI Gateway. Needs compute resourece like Lamda in between to call the EndPoint

50
Q

Standard Scaler

A

Performs scaling and shifting/centering

used to scale numberical data

51
Q

Xavier

A

initialization technique, not an optimazation technique

52
Q

AWS Batch

A

AWS Batch

  • Run batch jobs as Docker images
  • Dynamic provisioning of the instances (EC2 & Spot Instances)
  • Optimal quantity and type based on volume and requirements
  • No need to manage clusters, fully serverless
  • You just pay for the underlying EC2 instances
  • Schedule Batch Jobs using CloudWatch Events
  • Orchestrate Batch Jobs using AWS Step Functions
53
Q

AWS Batch vs Glue

A

Glue

  • Glue ETL - Run Apache Spark code, Scala or Python based, focus on the ETL
  • Glue ETL - Do not worry about configuring or managing the resources
  • Data Catalog to make the data available to Athena or Redshift Spectrum

Batch:

  • For any computing job regardless of the job (must provide Docker image)
  • Resources are created in your account, managed by Batch
  • For any non-ETL related work, Batch is probably better
54
Q

DMS - Database Migration Service

A

Quickly and securely migrate databases to AWS, resilient, self healing

The source database remains available during the migration

Supports:

  • Homogeneous migrations: ex Oracle to Oracle
  • Heterogeneous migrations: ex Microsoft SQL Server to Aurora

Continuous Data Replication using CDC

You must create an EC2 instance to perform the replication tasks

Change data capture (CDC) is the process of recognising when data has been changed in a source system so a downstream process or system can action that change

55
Q

AWS DMS vs Glue

A
56
Q

EMRFS

A

The EMR File System (EMRFS) is an

  • implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3.
  • EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like data encryption.
57
Q

Bootstrapping

A

Bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples of that population, using replacement during the sampling process.

58
Q

Interface vs Gateway Endpoints

A

An interface endpoint is an elastic network interface with a private IP address from the IP address range of your subnet. It serves as an entry point for traffic destined to a service that is owned by AWS or owned by an AWS customer or partner.

A gateway endpoint is a gateway that is a target for a route in your route table used for traffic destined to either Amazon S3 or DynamoDB.

59
Q

Kinesis Consumer Library

A

KCL helps you consume and process data from a Kinesis data stream by taking care of many of the complex tasks associated with distributed computing. These include load balancing across multiple consumer application instances, responding to consumer application instance failures, checkpointing processed records, and reacting to resharding. The KCL takes care of all of these subtasks so that you can focus your efforts on writing your custom record-processing logic