ACG Practice Questions Flashcards

1
Q

Your organization has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into?

A

Key-value pair JSON data is considered Semi-structured data because it doesn’t have a defined structure, but has some structural properties.
If our data is going to be used for a machine learning project in AWS, we need to find a way to get that data into S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You are trying to set up a crawler within AWS Glue that crawls your input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from your data and no tables are created within your AWS Glue Data Catalog. What is the reason for these results?

A

AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems.
If AWS Glue cannot determine the format of your input data, you will need to set up a custom classifier that helps AWS Glue crawler determine the schema of your input data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Your organization already uses Redshift as their data warehousing solution. Which tool can help you achieve this with the LEAST amount of effort?

A

Since the organization already uses Redshift as their data warehouse solution, Redshift spectrum would require less effort than using AWS Glue and Athena.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You are a ML specialist working with data that is stored in a distributed EMR cluster on AWS. Currently, your machine learning applications are compatible with the Apache Hive Metastore tables on EMR. You have been tasked with configuring Hive to use the AWS Glue Data Catalog as its metastore. Before you can do this you need to transfer the Apache Hive metastore tables into an AWS Glue Data Catalog. What are the steps you’ll need to take to achieve this with the LEAST amount of effort?

A

-The benefit of using Data Catalog (over Hive Metastore) is because it provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. -We can simply run a Hive script to query tables and output that data in CSV (or other formats) into S3. Once that data is on S3, we can crawl it to create a Data Catalog of the Hive Metastore or import the data directly from S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A
  • Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose.
  • You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users.
  • It is not important that we store or persist the clickstream data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

You have been tasked with capturing two different types of streaming events. The first event type includes mission critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but processing can continue without processing. What is the most appropriate solution to record these different types of events?

A

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False. If you have mission critical data that must be processed with as minimal delay as possible, you should use the Kinesis API (AWS SDK) over the Kinesis Producer Library.

A

True
The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?

A
  • The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams.
  • There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are your options for storing data into S3? (Choose 4)

A

You can use the AWS console, the AWS command line interface (cli), or the AWS SDK.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Your organization needs to find a way to capture streaming data from certain events customers are performing. These events are a crucial part of the organization’s business development and cannot afford to be lost. You’ve already set up a Kinesis Data Stream and a consumer EC2 instance to process and deliver the data into S3. You’ve noticed that the last few days of events are not showing up in S3 and your EC2 instance has been shutdown. What combination of steps can you take to ensure this does not happen again?

A

In this setup, the data is being ingested by Kinesis Data Streams and processes and delivered using an EC2 instance. It’s best practice to always setup CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Since this data is critical data that we cannot afford to lose, we should set the retention period for the maximum number of hours (168 hours or 7 days). Finally, we need to have reprocessed the failed records that are still in the data stream and that fail to write to S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

You work for a farming company that has dozens of tractors with build-in IoT devices. These devices stream data into AWS using Kinesis Data Streams. The features associated with the data is tractor Id, latitude, longitude, inside temp, outside temp, and fuel level. As a ML specialist you need to transform the data and store it in a data store. Which combination of services can you use to achieve this?

A
  • Kinesis Data Streams and Kinesis Data Analytics cannot write data directly to S3.
  • Kinesis Data Firehose is used as the main delivery mechanism for outputting data into S3.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

You are collecting clickstream data from an e-commerce website using Kinesis Data Firehose. You are using the PutRecord API from the AWS SDK to send the data to the stream. What are the required parameters when sending data to Kinesis Data Firehose using the API PutRecord call?

A

Kinesis Data Firehose is used as a delivery stream. We do not have to worry about shards, partition keys, etc. All we need is the Firehose DeliveryStreamName and the Record object (which contains the data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefore 1 shard is enough to handle the streaming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: { Hello world } and { Hello how are you }. What are the dimensions of the tf–idf vector/matrix?

A

There are 2 sentences (or corpus data we are vectorizing) with 5 unique unigrams (‘are’, ‘hello’, ‘how’, ‘world’, ‘you’) and there are 4 unique bigrams (‘are you’, ‘hello how’, ‘hello world’, ‘how are’). So the vectorized matrix would be (2, 9).

17
Q

We are designing a binary classification model that tries to predict whether a customer is likely to respond to a direct mailing of our catalog. Because it is expensive to print and mail our catalog, we want to only send to customers where we have a high degree of certainty they will buy something. When considering if the customer will buy something, what outcome would we want to minimize in a confusion matrix?

A

We would want to minimize the occurrence of False Positives. This would mean that our model predicted that the customer would buy something but the actual outcome was that the customer did not buy anything.

18
Q

Your company currently has a large on-prem Hadoop cluster that contains data you would like to use for a training job. Your cluster is equipped with Mahout, Flume, Hive, Spark, and Ganglia. How might you most efficiently use this data?

A

If the Hadoop cluster has Spark, you can use the SageMaker Spark Library to convert Spark DataFrame format into protobuf and load onto S3. From there, you can use SageMaker as normal.

19
Q

In a binary classification problem, you observe that precision is poor. Which of the following most contribute to poor precision?

A

Precision is defined as the ratio of True Positives over the sum of all Predicted Positives, which includes correctly labeled trues and those that we predicted as true but were really false (false positives). Another term for False Positives is Type I error.

20
Q

You are preparing for a first training run using a custom algorithm that you have prepared in a docker container. What should you do to ensure that the training metrics are visible to CloudWatch?

A

When using a custom algorithm, you need to ensure that the desired metrics are emitted to stdout output. You also need to include the metric definition and regex expression for the metric in the stdout output when defining the training job.

21
Q

After training and validation sessions, we notice that the accuracy rate for training is acceptable but the accuracy rate for validation is very poor. What might we do? (Choose 3)

A

High error rate observed in validation and not training usually indicates overfitting to the training data. We can introduce more data, add early stopping to the training job and reduce features among other things to help return the model to a generalizer.

22
Q

After multiple training runs, you notice that the the loss function settles on different but similar values. You believe that there is potential to improve the model through adjusting hyperparameters. What might you try next?

A

Learning rate can be thought of as the “step length” of the training process. A learning rate can be too large that it cannot find the the true global minimum. Decreasing the learning rate allows the training process to find lower loss function floors but it can also increase the time needed for convergence.

23
Q

Your company has just discovered a security breach occurred in a division separate from yours but has ordered a full review of all access logs. You have been asked to provide the last 180 days of access to the three SageMaker Hosted Service models that you manage. When you set up these deployments, you left everything default. How will you be able to respond?

A

CloudTrail is the proper service if you want to see who has sent API calls to your SageMaker Hosted model but, by default, it will only store the last 90 days of events. You can configure CloudTrail to store an unlimited amount of logs on S3 but this is not turned on by default. Whilst CloudTrail is not necessarily an Access Log, it performs the same auditing functions you might expect; and an auditor may not necessarily be familiar with the nuances of AWS