AWS ML Certification_ Test Questions Flashcards

1
Q

You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?

A

The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution.

You cannot deploy SageMaker to an EMR cluster, and

XGBoost actually requires LibSVM or CSV input, not RecordIO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Your automatic hyperparameter tuning job in SageMaker is consuming more resources than you would like, and coming at a high cost. What are TWO techniques that might reduce this cost?

A

Since the tuning process learns from each incremental step, too much concurrency can actually hinder that learning. Logarithmic ranges tend to find optimal values more quickly than linear ranges. Inference pipelines are a thing, but have nothing to do with this problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Your company wishes to monitor social media, and perform sentiment analysis on Tweets to classify them as positive or negative sentiment. You are able to obtain a data set of past Tweets about your company to use as training data for a machine learning system, but they are not classified as positive or negative. How would you build such a system?

A

A machine learning system needs labeled data to train itself with; there’s no getting around that. Only the Ground Truth answer produces the positive or negative labels we need, by using humans to create that training data initially. Another solution would be to use natural language processing through a service such as Amazon Comprehend.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A large news website needs to produce personalized recommendations for articles to its readers, by training a machine learning model on a daily basis using historical click data. The influx of this data is fairly constant, except during major elections when traffic to the site spikes considerably. Which system would provide the most cost-effective and simplest solution?

A

The use of spot instances in response to anticipated surges in usage is the most cost-effective approach for scaling up an EMR cluster. Kinesis streams is over-engineering because we do not have a real-time streaming requirement. Elasticsearch doesn’t make sense because Elasticsearch is not a recommender engine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

You are developing an autonomous vehicle that must classify images of street signs with extremely low latency, processing thousands of images per second. What AWS-based architecture would best meet this need?

A

SageMaker Neo is designed for compiling models using TensorFlow and other frameworks to edge devices such as Nvidia Jetson. The low latency requirement requires an edge solution, where the classification is being done within the vehicle itself and not over the air. Rekognition (which doesn’t have an “edge mode,” but does integrate with DeepLens) can’t handle the very specific classification task of identifying different street signs and what they mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

You are developing a machine learning model to predict house sale prices based on features of a house. 10% of the houses in your training data are missing the number of square feet in the home. Your training data set is not very large. Which technique would allow you to train your model while achieving the highest accuracy?

A

Deep learning is better suited to the imputation of categorical data. Square footage is numerical, which is better served by kNN. While simply dropping rows of missing data or using the mean values are a lot easier, they won’t result in the best results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A ride-hailing company needs to ingest and store certain attributes of real-time automobile health data which is in JSON format. The company does not want to manage the underlying infrastructure and it wants the data to be available for visualization on a near real time basis.
As an ML specialist, what is your recommendation so that the solution requires the least development time and infrastructure management?

A

Ingest the data using Kinesis Firehose that uses a Lambda function to write the selected attributes from the input data stream into an S3 location. Further pipe this processed data into QuickSight for visualizations
Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. Kinesis Data Firehose manages all underlying infrastructure, storage, networking, and configuration needed to capture and load your data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, or Splunk. You do not have to worry about provisioning, deployment, ongoing maintenance of the hardware, software, or write any other application to manage this process.

Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include Machine Learning-powered insights. QuickSight dashboards can be accessed from any device, and seamlessly embedded into your applications, portals, and websites.

This is the correct option as it can be used to process the streaming JSON data via Kinesis Firehose that uses a Lambda to write the selected attributes as JSON data into an S3 location. You should note that Firehose offers built-in integration with intermediary lambda functions to handle any transformations. This transformed data is then consumed in QuickSight for visualizations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Amazon Lambda

A

Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring and logging. With Lambda, you can run code for virtually any type of application or backend service. All you need to do is supply your code in one of the languages that Lambda supports.

Lambda functions are not meant to handle ETL workloads, so this option is also ruled out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Incremental Training

A

Over time, you might find that a model generates inferences that are not as good as they were in the past. With incremental training, you can use the artifacts from an existing model and use an expanded dataset to train a new model. Incremental training saves both time and resources.

You can use incremental training to:

Train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance.

Use the model artifacts or a portion of the model artifacts from a popular publicly available model in a training job. You don’t need to train a new model from scratch.

Resume a training job that was stopped.

Train several variants of a model, either with different hyperparameter settings or using different datasets.

You can read more on this reference link -

https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The data science team at a leading Questions and Answers website wants to improve the user experience and therefore would like to identify duplicate questions based on similarity of the text found in a given question.

As an ML Specialist, which SageMaker algorithm would you recommend to help solve this problem?

A

Object2Vec

Object2Vec can be used to find semantically similar objects such as questions. BlazingText Word2Vec can only find semantically similar words. Factorization Machines and XGBoost are not fit for this use-case.

Object2Vec : The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.

A good reference read for Object2Vec:

https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Factorization Machines

A

Factorization Machines - The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kinesis Data Analytics

A

Kinesis Data Analytics cannot directly write data into S3,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Authorisations for Amazon SageMaker

A

Amazon SageMaker supports authorization based on resource tags - You can attach tags to SageMaker resources or pass tags in a request to SageMaker. To control access based on tags, you provide tag information in the condition element of a policy using the sagemaker:ResourceTag/key-name, aws:RequestTag/key-name, or aws:TagKeys condition keys.

SageMaker does not support resource-based policies, so this option is incorrect.

SageMaker doesn’t support service-linked roles, so this option is incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

IAM features for Sagemaker

A

With IAM (Idendity and Access Management) identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied for Amazon SageMaker - With IAM identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied. SageMaker supports specific actions, resources, and condition keys. Administrators can use AWS JSON policies to specify who has access to what. That is, which principal can perform actions on what resources, and under what conditions. The Action element of a JSON policy describes the actions that you can use to allow or deny access in a policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Shard

A

A shard is a uniquely identitied sequence of data records in a stream. A stream is composed ot one or more shards, each ot which provides a tied unit of capacity. Each shard can support up to 5 transactions
per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition
kevs). The data capacitv of vour stream is a function of the number of shards that vou specifv for the stream. The total capacitv of the stream is the sum of the capacities of its shards
If vour data rate increases, vou can increase or decrease the number of shards allocated to vour stream. For more information. see Reshardina a Stream.

A Kinesis data stream is a set of shards. Each shard has a sequence of data records. Each data record has a sequence number that is assigned by Kinesis Data Streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How many shards would a Kinesis Data Streams application need if the average record size is 500KB and 2 records per second are being written into this application that has 7 consumers?

A

4

number_of_shards = max (incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)

where

incoming_write_bandwidth_in_KB = average_data_size_in_KB multiplied by the number_of_records_per_seconds. = 500 * 2 = 1000

outgoing_read_bandwidth_in_KB = incoming_write_bandwidth_in_KB multiplied by the number_of_consumers = 1000 * 7 = 7000

So, number_of_shards = max(1000/1000, 7000/2000) = max(1, 3.5) = 4

So, 4 shards are needed to address this use case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

An e-commerce company is looking for a solution that detects anomalies in order to identify fraudulent transactions. The company utilizes Amazon Kinesis to transfer JSON-formatted transaction records from its on-premises database to Amazon S3. The existing dataset comprises 200-column wide records for each transaction. To identify fraudulent transactions, the solution needs to analyze just 20 of these columns.

Which of the following would you suggest as the lowest-cost solution that needs the least development work and offers out-of-the-box anomaly detection functionality?

A

Transform the data from JSON format to Apache Parquet format using an AWS Glue job. Configure AWS Glue crawlers to discover the schema and build the AWS Glue Data Catalog. Leverage Amazon Athena to create a table with a subset of columns. Set up Amazon QuickSight for visual analysis of the data and identify fraudulent transactions using QuickSight’s built-in machine learning-powered anomaly detection

For the given use case, you can use an AWS Glue job to extract, transform, and load (ETL) data from the data source (in JSON format) to the data target (in Parquet format). You can then use an AWS Glue crawler, which is a program that connects to a data store (source or target) such as Amazon S3, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the AWS Glue Data Catalog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Principal Component Analyis

A

Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction. There are various methods for dimensionality reduction such as - Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), etc.

PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. PCA can help in identifying the most relevant derived features for the given use case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

K means

A

K-means is used for clustering and it cannot be used to identify the most relevant derived features for the given use case.

K-means is the right algorithm to uncover discrete groupings within the given dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Latent Dirichlet Allocation

A

Latent Dirichlet Allocation is used for topic modeling and it cannot be used to identify the most relevant derived features for the given use case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Neural topic Model

A

Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Neural topic Model

A

SageMaker NTM is used for topic modeling and it cannot be used to identify the most relevant derived features for the given use case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

SageMaker Ground Truth

A

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Amazon SageMaker Ground Truth makes it easy for you to efficiently and accurately label the datasets required for training machine learning systems. SageMaker Ground Truth can automatically label a portion of the dataset based on the labels done manually by human labelers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

AWS Glue ML Transforms for de-duplication

A

The data science team at an email marketing company has created a data lake with raw and refined zones. The raw zone has the data as it arrives from the source, however, the team wants to de-duplicate the data before it is written into the refined zone.

What is the best way to accomplish this with the least amount of development time and infrastructure maintenance effort?

You can create machine learning transforms to cleanse your data using AWS Glue ML Transforms. You can call these transforms from your ETL script. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.

You can create machine learning transforms to cleanse your data. You can call these transforms from your ETL script. Your data passes from transform to transform in a
data structure called a DvnamicFrame, which is an extension to an Apache Spark SOL DataFrame. The DynamicFrame contains our data, and vou reference its schema to
process your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

AWS Glue Input types not supported

A

AWS Glue jobs can be used to create serverless ETL jobs.

AWS Glue does NOT support Timestream as the source input type:

AWS Glue cannot import a Docker container with Tensor Flow

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development

Glue cannot write the output in RecordIO-Protobuf format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Apache Spark

A

Apache Spark is a distributed processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters. Amazon EMR is a web service that makes it easy for you to process and analyze vast amounts of data using applications in the Hadoop ecosystem, including Hive, Pig, HBase, Presto, Impala, and others.

Apache Spark (running on the EMR cluster in this use-case) can write the output in RecorIO-Protobuf format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

AWS Step Functions

A
  • Use to design workflows
  • Easy visualizations
  • Advanced Error Handling and Retry mechanism outside the code
  • Audit of the history of workflows
  • Ability to “Wait” for an arbitrary amount oftime
  • Max execution time of a State Machine is 1 year

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Lambda is not suited for long-running processes such as the task of transforming 1TB data into RecordIO-Protobuf format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Kinesis Data Firehose

A

It is not meant to be used for batch processing use cases and it cannot write data in RecorIO-Protobuf format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

CSV format for SageMaker

A

Many Amazon SageMaker algorithms support training with data in CSV format. To use data in CSV format for training, in the input data channel specification, specify text/csv as the ContentType. Amazon SageMaker requires that a CSV file does not have a header record and that the target variable is in the first column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

AWS Batch

A

AWS Batch

AWS Batch is a set of batch management capabilities that dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized compute resources) based on the volume and specific resource requirements of the batch jobs submitted.

With AWS Batch, there is no need to install and manage batch computing software or server clusters, allowing you to instead focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 (available with Spot Instances) and AWS compute resources with AWS Fargate or Fargate Spot.

AWS Batch can be used to configure and schedule resources for the given use case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

AWS Glue

A

AWS Glue - AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue consists of a Data Catalog which is a central metadata repository; an ETL engine that can automatically generate Scala or Python code; a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue cannot be used to configure and schedule resources for the given use case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Amazon SageMaker -

A

Amazon SageMaker is a fully managed service that provides the ability to build, train, and deploy machine learning (ML) models quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Amazon SageMaker authorisation

A

Amazon SageMaker supports authorization based on resource tags - You can attach tags to SageMaker resources or pass tags in a request to SageMaker. To control access based on tags, you provide tag information in the condition element of a policy using the sagemaker:ResourceTag/key-name, aws:RequestTag/key-name, or aws:TagKeys condition keys.

With IAM identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied for Amazon SageMaker - With IAM identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied. SageMaker supports specific actions, resources, and condition keys. Administrators can use AWS JSON policies to specify who has access to what. That is, which principal can perform actions on what resources, and under what conditions. The Action element of a JSON policy describes the actions that you can use to allow or deny access in a policy.

Incorrect options:

Amazon SageMaker supports resource-based policies - SageMaker does not support resource-based policies, so this option is incorrect.

Amazon SageMaker supports service linked roles - SageMaker doesn’t support service-linked roles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Inference Pipeline

A

Inference Pipeline can be considered as an Amazon SageMaker model that you can use to make either real-time predictions or to process batch transforms directly without any external preprocessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Sagemaker Logging and monitoring

A

CloudWatch keeps the SageMaker monitoring statistics for 15 months. However, the Amazon CloudWatch console limits the search to metrics that were updated in the last 2 weeks

SageMaker monitoring metrics are available on CloudWatch at a 1-minute frequency

CloudTrail does not monitor calls to InvokeEndpoint

AWS CloudTrail provides a record of actions taken by a user, role, or an AWS service in Amazon SageMaker. CloudTrail keeps this record for a period of 90 days

Amazon SageMaker is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in SageMaker. CloudTrail captures all API calls for SageMaker, with the exception of InvokeEndpoint, as events. The calls captured include calls from the SageMaker console and code calls to the SageMaker API operations.

You can troubleshoot operational and security incidents over the past 90 days in the CloudTrail console by viewing Event history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

First step of setting up a dataset in SageMaker

A

Shuffling: The first step would be to shuffle the dataset. Shuffling helps the training converge fast, prevents any bias during the training and prevents the model from learning the order of the training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

data warehouse

A

A data warehouse is a large collection of business data used to help an organization make decisions. A data warehouse periodically pulls data from the business apps and systems; then, the data goes through formatting and import processes to match the data already in the warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

data lake

A

A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Image Classification

A

Image Classification cannot be used to create labels for training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

After training a SageMaker XGBoost based model over a huge training dataset, the data science team observed that it has low accuracy on the training data as well as low accuracy on the test data.

A

Use more features in the model

Remove regularization from the model

Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y). Your model is overfitting your training data when you see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.

For the given use case, as the model has low accuracy on the training data as well as low accuracy on the test data, it suggests that the model has a bias, or in other words, the model is underfitting. When a model is underfitting, then adding more features to the model or removing regularization can help in addressing the underlying problem. In case of an underfitting model, adding more training data may or may not help.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Precision-Recall Area-Under-Curve (PR AUC)

A

This is an example where the dataset is imbalanced with fewer instances of positive class because of a fewer number of actual fraud records in the dataset. In such scenarios where we care more about the positive class, using PR AUC is a better choice, which is more sensitive to the improvements for the positive class.

PR AUC is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.

Please review these excellent resources for a deep-dive into PR AUC:

https: //neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc
https: //machinelearningmastery.com/imbalanced-classification-with-the-fraudulent-credit-card-transactions-dataset/

Incorrect options:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

SageMaker Deep AR

A

Based on historical data for a behavior, you can predict future behavior using DeepAR algorithm. For example, you can predict sales on a new product based on previous sales data. SageMaker DeepAR algorithm specializes in forecasting new product performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Factorization Machines

A
  • The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation. Factorization Machines cannot be used to forecast new product sales.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Amazon Lex

A

Lex

Amazon Lex is a service for building conversational interfaces using voice and text. Powered by the same conversational engine as Alexa, Amazon Lex provides high quality speech recognition and language understanding capabilities, enabling addition of sophisticated, natural language ‘chatbots’ to new and existing applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Amazon Polly

A

Polly

Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly’s Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech.

46
Q

Amazon Comprehend

A

Comprehend

Amazon Comprehend is a natural-language processing (NLP) service that uses machine learning to uncover information in unstructured data.

For the given use case, Lex can be used to convert speech to text via automatic speech recognition (ASR) and then further pipe this text to recognize the intent of the text via natural language understanding (NLU) by using the pre-configured Intents and Entities to come back with the most relevant text response. Comprehend can be used to uncover the insights and relationships in your input text. In the end, this text response would be converted to speech via Polly.

47
Q

Specificity

A

Specificity

Specificity = (True Negatives / (True Negatives + False Positives))

48
Q

Amazon Glue ETL

A

Glue ETL Job can transform the source data to Parquet format, it is best suited for batch ETL use cases and it’s not meant to process near real-time data.

49
Q

Amazon SageMaker Ground Truth

A

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Amazon SageMaker Ground Truth makes it easy for you to efficiently and accurately label the datasets required for training machine learning systems. SageMaker Ground Truth can automatically label a portion of the dataset based on the labels done manually by human labelers.

So, Ground Truth is the correct service for this use-case.

50
Q

Pipe Mode

A

With Pipe input mode, your dataset is streamed directly to your training instances instead of being downloaded first. This means that
your training jobs start sooner, finish quicker, and need less disk space. Amazon SageMaker algorithms have been
engineered to be fast and highly scalable.

51
Q

Elastic Compute Cloud (Amazon EC2)

A

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.

52
Q

Amazon EC2 P3

A

Amazon EC2 P3 instances are the next generation of Amazon EC2 GPU compute instances

53
Q

Neo

A

Neo is a capability of Amazon SageMaker that enables machine learning models to train once and run anywhere in the cloud and at the edge.

Since the robot should be able to classify the images of marine life forms in an autonomous way with low latency, Neo is the right fit for the given use case.

54
Q

An inference pipeline

A

An inference pipeline is a Amazon SageMaker model that is composed of a linear sequence of two to fifteen containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pretrained SageMaker built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. Inference pipelines are fully managed.

You can add SageMaker Spark ML Serving and scikit-learn containers that reuse the data transformers developed for training models. The entire assembled inference pipeline can be considered as a SageMaker model that you can use to make either real-time predictions or to process batch transforms directly without any external preprocessing.

55
Q

The Amazon SageMaker Latent Dirichlet Allocation (LDA)

A
  • The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.

LDA is most commonly used to discover:

  • a user-specified number of topics
  • shared by documents
  • within a text corpus.

Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

LDA is used for topic modeling, so it is not the right fit for the given use case.

56
Q

Softmax

A

Softmax is an extension of the Sigmoid activation function. Softmax function adds non-linearity to the output.

The softmax function is a more generalized sigmoid activation function which is used for multiclass classification

57
Q

Factorization Machine

A

Factorization Machines - The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

Factorization Machines cannot be used with a framework such as TensorFlow.

58
Q

You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?

A

The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution. You can’t deploy SageMaker to an EMR cluster, and XGBoost actually requires LibSVM or CSV input, not RecordIO.

59
Q

You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?

A

The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution. You can’t deploy SageMaker to an EMR cluster, and XGBoost actually requires LibSVM or CSV input, not RecordIO.

60
Q

Sgemaker: replicating only a subset of dataset

A

When you create a training job with the API, SageMaker replicates the entire dataset on ML compute instances by default. To make SageMaker replicate a subset of the data on each ML compute instance, you must set the S3DataDistributionType field to:

ShardedByS3Key.

You can set this field using the low-level SDK. For more information, see S3DataDistributionType in S3DataSource.

61
Q

Incremental Traning using SageMaker

A

You can train incrementally using the SageMaker console or the Amazon SageMaker Python SDK . Only three built-in algorithms currently support incremental training:

  • Object Detection Algorithm,
  • Image Classification Algorithm, and
  • Semantic Segmentation Algorithm.
62
Q

What is the criteria on which early stopping works in Amazon SageMaker?

A

To optimize your SageMaker training job, you can stop the training job early (that a hyperparameter tuning job launches) when they are not improving significantly as measured by the objective metric. Stopping training jobs early can help reduce compute time and helps you avoid overfitting your model.

When you enable early stopping for a hyperparameter tuning job, SageMaker uses the following criteria:

If the value of the objective metric for the current training job is worse (higher when minimizing or lower when maximizing the objective metric) than the median value of running averages of the objective metric for previous training jobs up to the same epoch, Amazon SageMaker stops the current training job.

63
Q

You are working on a fraud detection model based on SageMaker IP Insights algorithm with a training data set of 1TB in CSV format. Your Sagemaker Notebook instance has only 5GB of space.

A

The correct option is to shuffle the training data and create a 5GB slice of this shuffled data. Then you need to build your model on the Jupyter Notebook using this slice of training data. Once the evaluation metric looks good, you need to create a training job on SageMaker infrastructure with the appropriate instance types and instance counts to handle the entire training data.

64
Q

can AWS glue write output in recordIO protobuf format?

A

AWS Glue job cannot write output in recordIO-protobuf format, so this option is ruled out

65
Q

what file type can IP insights use?

A

IP Insights algorithm supports only CSV file type as training data, so both these options using parquet or recordIO-protobuf are ruled out.

66
Q

N-grams -

A

N-grams - An N-gram is simply a sequence of N words. N-grams cannot be used to build a context aware model.

67
Q

Object2Vec

A

Object2Vec can be used to find semantically similar objects such as questions.
BlazingText Word2Vec can only find semantically similar words. Factorization Machines and XGBoost are not fit for this use-case.

68
Q

XGBoost -

A

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that you can fine-tune.

You can use XGBoost for regression, classification (binary and multiclass), and ranking problems.

69
Q

Object2Vec

A

Object2Vec can be used to find semantically similar objects such as questions. BlazingText Word2Vec can only find semantically similar words. Factorization Machines and XGBoost are not fit for this use-case.

Object2Vec : The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.

A good reference read for Object2Vec:

https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/

Incorrect options:

70
Q

Factorization Machines

A

The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks.

It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

71
Q

BlazingText Word2Vec mode

A

BlazingText Word2Vec mode - The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms.

The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.

The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector representation of a word is called a word embedding. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words.

72
Q

AWS Glue

A

AWS Glue job cannot write output in recordIO-protobuf format,

73
Q

One-hot Encoding

A
  • A one hot encoding is a representation of categorical variables as binary vectors. Initially, categorical values are mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

One-hot encoding would not capture the price variance with respect to size, so this option is incorrect.

74
Q

L1 regularization

A

Leverage L1 regularization for the classifier

Regularization helps prevent linear models from overfitting training data examples (that is, memorizing patterns instead of generalizing them) by penalizing extreme weight values. L1 regularization has the effect of reducing the number of features used in the model by pushing to zero the weights of features that would otherwise have small weights. As a result, L1 regularization results in sparse models and reduces the amount of noise in the model.

The given use case states that there is a significant difference in the accuracy of the training and validation datasets, so the model is overfitting. Therefore, you can use L1 regularization as a solution for the given scenario.

75
Q

Recursive Feature Elimination

A

Recursive Feature Elimination, or RFE is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable. Hence it allows you to choose the best features without modifying the existing features, which is a key requirement of the given scenario, as the Sales department wants to interpret the model and then determine the direct effect of significant characteristics on the model’s output.

76
Q

t-Distributed Stochastic Neighbor Embedding (t-SNE)

A

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique primarily used for data exploration, dimensionality reduction and visualizing high-dimensional data. In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space. You won’t be able to see the direct impact of relevant features on the model outcome. Note that although you could use t-SNE for dimensionality reduction, it would result in dimensions that are not directly interpretable. Remember the key requirement of the given scenario - the Sales department wants to interpret the model and then determine the direct effect of significant characteristics on the model’s output. So this option is incorrect.

77
Q

Identify cyclical sales patterns

A

The best way to uncover any cyclical sales patterns is to engineer the cyclical features by representing these as (x,y) coordinates on a circle using sin and cos functions.

Highly recommend this deep-dive on feature engineering for cyclical features -

http://blog.davidkaleko.com/feature-engineering-cyclical-features.html

78
Q

one-hot encoding

A

You cannot use one-hot encoding to uncover any cyclical sales patterns.

79
Q

Latent Dirichlet Allocation (LDA)

A

Latent Dirichlet Allocation (LDA)

The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. You can use LDA to figure out the right categories for each product.

80
Q

AWS Rekognition

A

Amazon Rekognition Video has a feature called Pathing that can create a track of the path people take in videos and provide information such as:

The location of the person in the video frame at the time their path is tracked.

Facial landmarks such as the position of the left eye, when detected.

So, AWS Rekognition can be used to build a solution for this use-case.

81
Q

Chainer

A

Chainer is an open source deep learning framework written purely in Python on top of NumPy and CuPy Python libraries. The development is led by Japanese venture company Preferred Networks in partnership with IBM, Intel, Microsoft, and Nvidia.

82
Q

Amazon Personalize

A

Amazon Personalize is a machine learning service that makes it easy for developers to create individualized recommendations for customers using their applications.

Amazon Personalize makes it easy for developers to build applications capable of delivering a wide array of personalization experiences, including specific product recommendations, personalized product re-ranking, and customized direct marketing.

There is no need for development, training and testing of custom models when using Personalize.

83
Q

Amazon Kinesis Video Streams

A

Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, machine learning (ML), playback, and other processing. Kinesis Video Streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices. It durably stores, encrypts, and indexes video data in your streams, and allows you to access your data through easy-to-use APIs.

Kinesis Video Streams enables you to playback video for live and on-demand viewing, and quickly build applications that take advantage of computer vision and video analytics through integration with Amazon Rekognition Video, and libraries for ML frameworks such as Apache MxNet, TensorFlow, and OpenCV. Kinesis Video Streams also supports WebRTC, an open-source project that enables real-time media streaming and interaction between web browsers, mobile applications, and connected devices via simple APIs. Typical uses include video chat and peer-to-peer media streaming.

84
Q

Kinesis Data Streams

A

Kinesis Data Streams cannot directly consume the incoming video stream data. You will need to develop custom code to process the incoming video streams, so this option is incorrect.

85
Q

Kinesis Data Analytics

A

Kinesis Data Analytics cannot directly consume the incoming video stream data. You will need to develop custom code to process the incoming video streams via Kinesis Data Streams and then direct the resultant feed into Kinesis Data Analytics, so this option is incorrect.

86
Q

Kinesis Producer Library

A

An Amazon Kinesis Data Streams producer is an application that puts user data records into a Kinesis data stream (also called data ingestion). The Kinesis Producer Library (KPL) simplifies producer application development, allowing developers to achieve high write throughput to a Kinesis data stream.

87
Q

Amazon Elastic Inference

A

Amazon Elastic Inference accelerators are network attached devices that work along with SageMaker instances in your endpoint to accelerate your inference calls. Elastic Inference accelerates inference by allowing you to attach fractional GPs to any SageMaker instance.

88
Q

N-grams

A

An N-gram is simply a sequence of N words. N-grams cannot be used to build a context aware model.

89
Q

Word2Vec

A

Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors: feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand. Word2vec cannot be used to build a context aware model.

90
Q

What measures would you take to make sure that the data is protected in-transit even for inter-node training communications?

A

There are no inter-node communications for batch processing, so inter-node traffic encryption is not required

91
Q

K-fold validation

A

In this validation approach, you split the example dataset into k parts. You treat each of these parts as a holdout set for k training runs, and use the other k-1 parts as the training set for that run. You produce k models using a similar process, and aggregate the models to generate your final model. The value k is typically in the range of 5-10.

So, you can use validation via a holdout set or K-fold validation to tune the hyperparameters for the Sagemaker XGBoost algorithm.

92
Q

Mandatory hyperparameters for the SageMaker K-means algorithm?

A

feature_dim - This represents the number of features in the input data.

k - This represents the number of required clusters.

93
Q

Only three of the built-in SageMaker algorithms support incremental training. Can you identify those three algorithms?

A

Object Detection

Image Classification

Semantic Segmentation

94
Q

inference pipeline

A

Within an inference pipeline model, Amazon SageMaker handles invocations as a sequence of HTTP requests.

95
Q

What is the format in which Amazon Sagemaker models are stored?

A

model.tar.gz

96
Q

CloudWatch

A

monitor your system while storing all the logs and operational metrics separately from the actual implementation and code for training and testing your ML models. In this example, Amazon CloudWatch is used to keep a history of the model metrics for a specific amount of time, visualize model performance metrics, and create a CloudWatch dashboard

Amazon SageMaker provides out-of-the-box integration with Amazon CloudWatch, which collects near-real-time utilization metrics for the training job instance, such as CPU, memory, and GPU utilization of the training job container.

97
Q

AWS CloudTrail

A

AWS CloudTrail: auditing, leaving a trail of activity

captures API calls and related events made by or on behalf of your AWS account and delivers the log files to an Amazon S3 bucket that you specify. You can identify which users and accounts called AWS, the source IP address from which the calls were made, and when the calls occurred.

98
Q

Amazon Machine Learning Stack

A

Amazon provides machine learning resources in three “layers of the AI stack”

  • the first one is framework tools(e.g. tensorflow MxNet, Pytorch), interfaces( Gluon, Keras) and infrastructue (ECS, Elastic Inference…)
  • the second one API-driven services, and
  • the third is machine learning platforms.
99
Q

IAM Roles in Sagemaker

A

To keep this process secure, Amazon SageMaker supports IAM role-based access to secure your artifacts in Amazon S3, where you can set different roles for different parts of the process. For instance, a certain data scientist can have access to PII information in the raw data bucket, but the DevOps engineer only has access to the trained model itself

If you disable direct internet access, the notebook instance won’t be able to train or host models unless your VPC has an interface endpoint (PrivateLink) or a NAT gateway and your security groups allow outbound connections.

100
Q

SageMaker encryption

A

Amazon SageMaker also encrypts data at rest

Along with IAM roles to prevent unwanted access, Amazon SageMaker also encrypts data at rest with either AWS Key Management Service (AWS KMS) or a transient key if the key isn’t provided and in transit with TLS 1.2 encryption for the all other communication. Users can connect to the notebook instances using an AWS SigV4 authentication so that any connection remains secure. Any API call you make is executed over an SSL connection.

101
Q

Linear learner automatic hyperparameter tuning

A

The linear learner algorithm also has an internal mechanism for tuning hyperparameters separate from the automatic model tuning feature described here. By default, the linear learner algorithm tunes hyperparameters by training multiple models in parallel. When you use automatic model tuning, the linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel models, num_models, to 1. The algorithm ignores any value that you set for num_models.

102
Q

AWS Rekognition

A

Rekognition cannot be used to create labels for the training videos, so this option is not relevant for the given use case.

103
Q

What measures would you take to make sure that the data is protected in-transit even for inter-node training communications?

A

There are no inter-node communications for batch processing, so inter-node traffic encryption is not required

104
Q

What is an estimator?

A

In machine learning, an estimator is an equation for picking the “best,” or most likely accurate, data model based upon observations in realty.

105
Q

Gluon

A

Gluon is an open source deep learning library jointly created by AWS and Microsoft that helps developers build, train and deploy machine learning models in the cloud.

106
Q

S3 Encryption for Objects

A

There are 4 methods of encrypting objects in S3

  • SSE-S3: encrypts S3 objects using keys handled & managed by AWS
  • SSE-KMS: use AWS Key Management Service to manage encryption keys
    • Additional security (user must have access to KMS key)
    • Audit trail for KMS key usage
    • SSE-C: when you want to manage your own encryption keys
    • Client Side Encryption (encrypt data outside of AWS before sending it to AWS

From an ML perspective, SSE-S3 and SSE-KMS will be most likely used

SSE: Server side Encryption

CSE: Client side encryption

107
Q

S3 Security

A

User based

  • IAM (Idendity and Access Management) policies - which API calls should be allowed for a specific user

Resource Based

  • Bucket Policies - bucket wide rules from the S3 console - allows cross account access
  • Object Access Control List (ACL) – finer grain
  • Bucket Access Control List (ACL) – less common
108
Q

S3 Bucket Policies

A

JSON based policies

  • Resources: buckets and objects
  • Actions: Set of API to Allow or Deny (put object or get object)
  • Effect: Allow / Deny
  • Principal: The account or user to apply the policy to

Use S3 bucket for policy to:

  • Grant public access to the bucket
  • Force objects to be encrypted at upload
  • Grant access to another account (Cross Account)

bucket policies are evaluated BEFORE encryption

109
Q

Kinesis Firehose

A

Firehose offers built-in integration with intermediary lambda functions to handle any transformations. This transformed data is then consumed in QuickSight for visualizations.

110
Q

A large news website needs to produce personalized recommendations for articles to its readers, by training a machine learning model on a daily basis using historical click data. The influx of this data is fairly constant, except during major elections when traffic to the site spikes considerably. Which system would provide the most cost-effective and reliable solution?

A

Using glue in the architecture involves custom code development for the Glue script to handle the streaming data. Also you need to set up both Kinesis Data Streams and Glue Streaming for this option, so it turns out to be more complex then just directly using Kinesis Firehose.