ACG Practice Questions Flashcards
Your organization has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into?
Key-value pair JSON data is considered Semi-structured data because it doesn’t have a defined structure, but has some structural properties.
If our data is going to be used for a machine learning project in AWS, we need to find a way to get that data into S3.
You are trying to set up a crawler within AWS Glue that crawls your input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from your data and no tables are created within your AWS Glue Data Catalog. What is the reason for these results?
AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems.
If AWS Glue cannot determine the format of your input data, you will need to set up a custom classifier that helps AWS Glue crawler determine the schema of your input data.
You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Your organization already uses Redshift as their data warehousing solution. Which tool can help you achieve this with the LEAST amount of effort?
Since the organization already uses Redshift as their data warehouse solution, Redshift spectrum would require less effort than using AWS Glue and Athena.
You are a ML specialist working with data that is stored in a distributed EMR cluster on AWS. Currently, your machine learning applications are compatible with the Apache Hive Metastore tables on EMR. You have been tasked with configuring Hive to use the AWS Glue Data Catalog as its metastore. Before you can do this you need to transfer the Apache Hive metastore tables into an AWS Glue Data Catalog. What are the steps you’ll need to take to achieve this with the LEAST amount of effort?
-The benefit of using Data Catalog (over Hive Metastore) is because it provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. -We can simply run a Hive script to query tables and output that data in CSV (or other formats) into S3. Once that data is on S3, we can crawl it to create a Data Catalog of the Hive Metastore or import the data directly from S3.
Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.
You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
- Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose.
- You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users.
- It is not important that we store or persist the clickstream data.
You have been tasked with capturing two different types of streaming events. The first event type includes mission critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but processing can continue without processing. What is the most appropriate solution to record these different types of events?
The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type.
True or False. If you have mission critical data that must be processed with as minimal delay as possible, you should use the Kinesis API (AWS SDK) over the Kinesis Producer Library.
True
The KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly.
Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?
Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.
Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
- The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams.
- There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
What are your options for storing data into S3? (Choose 4)
You can use the AWS console, the AWS command line interface (cli), or the AWS SDK.
Your organization needs to find a way to capture streaming data from certain events customers are performing. These events are a crucial part of the organization’s business development and cannot afford to be lost. You’ve already set up a Kinesis Data Stream and a consumer EC2 instance to process and deliver the data into S3. You’ve noticed that the last few days of events are not showing up in S3 and your EC2 instance has been shutdown. What combination of steps can you take to ensure this does not happen again?
In this setup, the data is being ingested by Kinesis Data Streams and processes and delivered using an EC2 instance. It’s best practice to always setup CloudWatch monitoring for your EC2 instance as well as AutoScaling if your consumer EC2 instance is shutdown. Since this data is critical data that we cannot afford to lose, we should set the retention period for the maximum number of hours (168 hours or 7 days). Finally, we need to have reprocessed the failed records that are still in the data stream and that fail to write to S3.
You work for a farming company that has dozens of tractors with build-in IoT devices. These devices stream data into AWS using Kinesis Data Streams. The features associated with the data is tractor Id, latitude, longitude, inside temp, outside temp, and fuel level. As a ML specialist you need to transform the data and store it in a data store. Which combination of services can you use to achieve this?
- Kinesis Data Streams and Kinesis Data Analytics cannot write data directly to S3.
- Kinesis Data Firehose is used as the main delivery mechanism for outputting data into S3.
You are collecting clickstream data from an e-commerce website using Kinesis Data Firehose. You are using the PutRecord API from the AWS SDK to send the data to the stream. What are the required parameters when sending data to Kinesis Data Firehose using the API PutRecord call?
Kinesis Data Firehose is used as a delivery stream. We do not have to worry about shards, partition keys, etc. All we need is the Firehose DeliveryStreamName and the Record object (which contains the data).
You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefore 1 shard is enough to handle the streaming data.