AWS ML Certification_ Test Questions Flashcards
You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?
The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution.
You cannot deploy SageMaker to an EMR cluster, and
XGBoost actually requires LibSVM or CSV input, not RecordIO.
Your automatic hyperparameter tuning job in SageMaker is consuming more resources than you would like, and coming at a high cost. What are TWO techniques that might reduce this cost?
Since the tuning process learns from each incremental step, too much concurrency can actually hinder that learning. Logarithmic ranges tend to find optimal values more quickly than linear ranges. Inference pipelines are a thing, but have nothing to do with this problem.
Your company wishes to monitor social media, and perform sentiment analysis on Tweets to classify them as positive or negative sentiment. You are able to obtain a data set of past Tweets about your company to use as training data for a machine learning system, but they are not classified as positive or negative. How would you build such a system?
A machine learning system needs labeled data to train itself with; there’s no getting around that. Only the Ground Truth answer produces the positive or negative labels we need, by using humans to create that training data initially. Another solution would be to use natural language processing through a service such as Amazon Comprehend.
A large news website needs to produce personalized recommendations for articles to its readers, by training a machine learning model on a daily basis using historical click data. The influx of this data is fairly constant, except during major elections when traffic to the site spikes considerably. Which system would provide the most cost-effective and simplest solution?
The use of spot instances in response to anticipated surges in usage is the most cost-effective approach for scaling up an EMR cluster. Kinesis streams is over-engineering because we do not have a real-time streaming requirement. Elasticsearch doesn’t make sense because Elasticsearch is not a recommender engine.
You are developing an autonomous vehicle that must classify images of street signs with extremely low latency, processing thousands of images per second. What AWS-based architecture would best meet this need?
SageMaker Neo is designed for compiling models using TensorFlow and other frameworks to edge devices such as Nvidia Jetson. The low latency requirement requires an edge solution, where the classification is being done within the vehicle itself and not over the air. Rekognition (which doesn’t have an “edge mode,” but does integrate with DeepLens) can’t handle the very specific classification task of identifying different street signs and what they mean.
You are developing a machine learning model to predict house sale prices based on features of a house. 10% of the houses in your training data are missing the number of square feet in the home. Your training data set is not very large. Which technique would allow you to train your model while achieving the highest accuracy?
Deep learning is better suited to the imputation of categorical data. Square footage is numerical, which is better served by kNN. While simply dropping rows of missing data or using the mean values are a lot easier, they won’t result in the best results.
A ride-hailing company needs to ingest and store certain attributes of real-time automobile health data which is in JSON format. The company does not want to manage the underlying infrastructure and it wants the data to be available for visualization on a near real time basis.
As an ML specialist, what is your recommendation so that the solution requires the least development time and infrastructure management?
Ingest the data using Kinesis Firehose that uses a Lambda function to write the selected attributes from the input data stream into an S3 location. Further pipe this processed data into QuickSight for visualizations
Amazon Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. Kinesis Data Firehose manages all underlying infrastructure, storage, networking, and configuration needed to capture and load your data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, or Splunk. You do not have to worry about provisioning, deployment, ongoing maintenance of the hardware, software, or write any other application to manage this process.
Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include Machine Learning-powered insights. QuickSight dashboards can be accessed from any device, and seamlessly embedded into your applications, portals, and websites.
This is the correct option as it can be used to process the streaming JSON data via Kinesis Firehose that uses a Lambda to write the selected attributes as JSON data into an S3 location. You should note that Firehose offers built-in integration with intermediary lambda functions to handle any transformations. This transformed data is then consumed in QuickSight for visualizations.
Amazon Lambda
Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring and logging. With Lambda, you can run code for virtually any type of application or backend service. All you need to do is supply your code in one of the languages that Lambda supports.
Lambda functions are not meant to handle ETL workloads, so this option is also ruled out.
Incremental Training
Over time, you might find that a model generates inferences that are not as good as they were in the past. With incremental training, you can use the artifacts from an existing model and use an expanded dataset to train a new model. Incremental training saves both time and resources.
You can use incremental training to:
Train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance.
Use the model artifacts or a portion of the model artifacts from a popular publicly available model in a training job. You don’t need to train a new model from scratch.
Resume a training job that was stopped.
Train several variants of a model, either with different hyperparameter settings or using different datasets.
You can read more on this reference link -
https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html
The data science team at a leading Questions and Answers website wants to improve the user experience and therefore would like to identify duplicate questions based on similarity of the text found in a given question.
As an ML Specialist, which SageMaker algorithm would you recommend to help solve this problem?
Object2Vec
Object2Vec can be used to find semantically similar objects such as questions. BlazingText Word2Vec can only find semantically similar words. Factorization Machines and XGBoost are not fit for this use-case.
Object2Vec : The Amazon SageMaker Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression.
A good reference read for Object2Vec:
https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/
Factorization Machines
Factorization Machines - The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.
Kinesis Data Analytics
Kinesis Data Analytics cannot directly write data into S3,
Authorisations for Amazon SageMaker
Amazon SageMaker supports authorization based on resource tags - You can attach tags to SageMaker resources or pass tags in a request to SageMaker. To control access based on tags, you provide tag information in the condition element of a policy using the sagemaker:ResourceTag/key-name, aws:RequestTag/key-name, or aws:TagKeys condition keys.
SageMaker does not support resource-based policies, so this option is incorrect.
SageMaker doesn’t support service-linked roles, so this option is incorrect.
IAM features for Sagemaker
With IAM (Idendity and Access Management) identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied for Amazon SageMaker - With IAM identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied. SageMaker supports specific actions, resources, and condition keys. Administrators can use AWS JSON policies to specify who has access to what. That is, which principal can perform actions on what resources, and under what conditions. The Action element of a JSON policy describes the actions that you can use to allow or deny access in a policy.
Shard
A shard is a uniquely identitied sequence of data records in a stream. A stream is composed ot one or more shards, each ot which provides a tied unit of capacity. Each shard can support up to 5 transactions
per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition
kevs). The data capacitv of vour stream is a function of the number of shards that vou specifv for the stream. The total capacitv of the stream is the sum of the capacities of its shards
If vour data rate increases, vou can increase or decrease the number of shards allocated to vour stream. For more information. see Reshardina a Stream.
A Kinesis data stream is a set of shards. Each shard has a sequence of data records. Each data record has a sequence number that is assigned by Kinesis Data Streams
How many shards would a Kinesis Data Streams application need if the average record size is 500KB and 2 records per second are being written into this application that has 7 consumers?
4
number_of_shards = max (incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)
where
incoming_write_bandwidth_in_KB = average_data_size_in_KB multiplied by the number_of_records_per_seconds. = 500 * 2 = 1000
outgoing_read_bandwidth_in_KB = incoming_write_bandwidth_in_KB multiplied by the number_of_consumers = 1000 * 7 = 7000
So, number_of_shards = max(1000/1000, 7000/2000) = max(1, 3.5) = 4
So, 4 shards are needed to address this use case.
An e-commerce company is looking for a solution that detects anomalies in order to identify fraudulent transactions. The company utilizes Amazon Kinesis to transfer JSON-formatted transaction records from its on-premises database to Amazon S3. The existing dataset comprises 200-column wide records for each transaction. To identify fraudulent transactions, the solution needs to analyze just 20 of these columns.
Which of the following would you suggest as the lowest-cost solution that needs the least development work and offers out-of-the-box anomaly detection functionality?
Transform the data from JSON format to Apache Parquet format using an AWS Glue job. Configure AWS Glue crawlers to discover the schema and build the AWS Glue Data Catalog. Leverage Amazon Athena to create a table with a subset of columns. Set up Amazon QuickSight for visual analysis of the data and identify fraudulent transactions using QuickSight’s built-in machine learning-powered anomaly detection
For the given use case, you can use an AWS Glue job to extract, transform, and load (ETL) data from the data source (in JSON format) to the data target (in Parquet format). You can then use an AWS Glue crawler, which is a program that connects to a data store (source or target) such as Amazon S3, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the AWS Glue Data Catalog.
Principal Component Analyis
Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction. There are various methods for dimensionality reduction such as - Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), etc.
PCA is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. PCA can help in identifying the most relevant derived features for the given use case.
K means
K-means is used for clustering and it cannot be used to identify the most relevant derived features for the given use case.
K-means is the right algorithm to uncover discrete groupings within the given dataset.
Latent Dirichlet Allocation
Latent Dirichlet Allocation is used for topic modeling and it cannot be used to identify the most relevant derived features for the given use case.
Neural topic Model
Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution.
Neural topic Model
SageMaker NTM is used for topic modeling and it cannot be used to identify the most relevant derived features for the given use case.
SageMaker Ground Truth
Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. Amazon SageMaker Ground Truth makes it easy for you to efficiently and accurately label the datasets required for training machine learning systems. SageMaker Ground Truth can automatically label a portion of the dataset based on the labels done manually by human labelers.
AWS Glue ML Transforms for de-duplication
The data science team at an email marketing company has created a data lake with raw and refined zones. The raw zone has the data as it arrives from the source, however, the team wants to de-duplicate the data before it is written into the refined zone.
What is the best way to accomplish this with the least amount of development time and infrastructure maintenance effort?
You can create machine learning transforms to cleanse your data using AWS Glue ML Transforms. You can call these transforms from your ETL script. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly.
You can create machine learning transforms to cleanse your data. You can call these transforms from your ETL script. Your data passes from transform to transform in a
data structure called a DvnamicFrame, which is an extension to an Apache Spark SOL DataFrame. The DynamicFrame contains our data, and vou reference its schema to
process your data.
AWS Glue Input types not supported
AWS Glue jobs can be used to create serverless ETL jobs.
AWS Glue does NOT support Timestream as the source input type:
AWS Glue cannot import a Docker container with Tensor Flow
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development
Glue cannot write the output in RecordIO-Protobuf format.
Apache Spark
Apache Spark is a distributed processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters. Amazon EMR is a web service that makes it easy for you to process and analyze vast amounts of data using applications in the Hadoop ecosystem, including Hive, Pig, HBase, Presto, Impala, and others.
Apache Spark (running on the EMR cluster in this use-case) can write the output in RecorIO-Protobuf format.
AWS Step Functions
- Use to design workflows
- Easy visualizations
- Advanced Error Handling and Retry mechanism outside the code
- Audit of the history of workflows
- Ability to “Wait” for an arbitrary amount oftime
- Max execution time of a State Machine is 1 year
AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Lambda is not suited for long-running processes such as the task of transforming 1TB data into RecordIO-Protobuf format.
Kinesis Data Firehose
It is not meant to be used for batch processing use cases and it cannot write data in RecorIO-Protobuf format.
CSV format for SageMaker
Many Amazon SageMaker algorithms support training with data in CSV format. To use data in CSV format for training, in the input data channel specification, specify text/csv as the ContentType. Amazon SageMaker requires that a CSV file does not have a header record and that the target variable is in the first column.
AWS Batch
AWS Batch
AWS Batch is a set of batch management capabilities that dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized compute resources) based on the volume and specific resource requirements of the batch jobs submitted.
With AWS Batch, there is no need to install and manage batch computing software or server clusters, allowing you to instead focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 (available with Spot Instances) and AWS compute resources with AWS Fargate or Fargate Spot.
AWS Batch can be used to configure and schedule resources for the given use case.
AWS Glue
AWS Glue - AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue consists of a Data Catalog which is a central metadata repository; an ETL engine that can automatically generate Scala or Python code; a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue cannot be used to configure and schedule resources for the given use case.
Amazon SageMaker -
Amazon SageMaker is a fully managed service that provides the ability to build, train, and deploy machine learning (ML) models quickly.
Amazon SageMaker authorisation
Amazon SageMaker supports authorization based on resource tags - You can attach tags to SageMaker resources or pass tags in a request to SageMaker. To control access based on tags, you provide tag information in the condition element of a policy using the sagemaker:ResourceTag/key-name, aws:RequestTag/key-name, or aws:TagKeys condition keys.
With IAM identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied for Amazon SageMaker - With IAM identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied. SageMaker supports specific actions, resources, and condition keys. Administrators can use AWS JSON policies to specify who has access to what. That is, which principal can perform actions on what resources, and under what conditions. The Action element of a JSON policy describes the actions that you can use to allow or deny access in a policy.
Incorrect options:
Amazon SageMaker supports resource-based policies - SageMaker does not support resource-based policies, so this option is incorrect.
Amazon SageMaker supports service linked roles - SageMaker doesn’t support service-linked roles
Inference Pipeline
Inference Pipeline can be considered as an Amazon SageMaker model that you can use to make either real-time predictions or to process batch transforms directly without any external preprocessing.
Sagemaker Logging and monitoring
CloudWatch keeps the SageMaker monitoring statistics for 15 months. However, the Amazon CloudWatch console limits the search to metrics that were updated in the last 2 weeks
SageMaker monitoring metrics are available on CloudWatch at a 1-minute frequency
CloudTrail does not monitor calls to InvokeEndpoint
AWS CloudTrail provides a record of actions taken by a user, role, or an AWS service in Amazon SageMaker. CloudTrail keeps this record for a period of 90 days
Amazon SageMaker is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in SageMaker. CloudTrail captures all API calls for SageMaker, with the exception of InvokeEndpoint, as events. The calls captured include calls from the SageMaker console and code calls to the SageMaker API operations.
You can troubleshoot operational and security incidents over the past 90 days in the CloudTrail console by viewing Event history.
First step of setting up a dataset in SageMaker
Shuffling: The first step would be to shuffle the dataset. Shuffling helps the training converge fast, prevents any bias during the training and prevents the model from learning the order of the training.
data warehouse
A data warehouse is a large collection of business data used to help an organization make decisions. A data warehouse periodically pulls data from the business apps and systems; then, the data goes through formatting and import processes to match the data already in the warehouse.
data lake
A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.
Image Classification
Image Classification cannot be used to create labels for training data.
After training a SageMaker XGBoost based model over a huge training dataset, the data science team observed that it has low accuracy on the training data as well as low accuracy on the test data.
Use more features in the model
Remove regularization from the model
Your model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y). Your model is overfitting your training data when you see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.
For the given use case, as the model has low accuracy on the training data as well as low accuracy on the test data, it suggests that the model has a bias, or in other words, the model is underfitting. When a model is underfitting, then adding more features to the model or removing regularization can help in addressing the underlying problem. In case of an underfitting model, adding more training data may or may not help.
Precision-Recall Area-Under-Curve (PR AUC)
This is an example where the dataset is imbalanced with fewer instances of positive class because of a fewer number of actual fraud records in the dataset. In such scenarios where we care more about the positive class, using PR AUC is a better choice, which is more sensitive to the improvements for the positive class.
PR AUC is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.
Please review these excellent resources for a deep-dive into PR AUC:
https: //neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc
https: //machinelearningmastery.com/imbalanced-classification-with-the-fraudulent-credit-card-transactions-dataset/
Incorrect options:
SageMaker Deep AR
Based on historical data for a behavior, you can predict future behavior using DeepAR algorithm. For example, you can predict sales on a new product based on previous sales data. SageMaker DeepAR algorithm specializes in forecasting new product performance.
Factorization Machines
- The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation. Factorization Machines cannot be used to forecast new product sales.
Amazon Lex
Lex
Amazon Lex is a service for building conversational interfaces using voice and text. Powered by the same conversational engine as Alexa, Amazon Lex provides high quality speech recognition and language understanding capabilities, enabling addition of sophisticated, natural language ‘chatbots’ to new and existing applications.