Domain 3: Processing Flashcards

Question 1

Q

Of the following tools with Amazon EMR, which one is used for querying multiple data stores at once?

1) Presto
2) Hue
3) Ganglia
4) Ambari

Answer

A

1) Presto

Question 2

Q

Which one of the following statements is NOT TRUE regarding EMR Notebooks?

1) EMR notebook is stopped if is idle for an extended time
2) EMR notebook integrates with repositories for version control, including GitHub, CodeCommit, and BitBucket
3) EMR Notebooks can be opened without logging into the AWS Management Console
4) You can not attach your notebook to a Kerberos enabled EMR cluster

Answer

A

3) To create or open a notebook and run queries on your EMR cluster you need to log into the AWS Management Console.

Question 3

Q

When you delete your EMR cluster, what happens to the EBS volumes?

1) EMR will delete the volumes once the EMR cluster is terminated
2) EBS volumes are preserved

Answer

A

2) EBS volumes will be deleted. f you don’t want the data on your cluster to be ephemeral, be sure to store or copy it in S3.

Question 4

Q

Which one of the following statements is NOT TRUE regarding Apache Pig?

1) Pig supports interactive and batch cluster types
2) Pig is operated by a SQL-like language called Pig Latin
3) When used with Amazon EMR, Pig allows accessing multiple file systems
4) Pig supports access through JDBC

Answer

A

4) Pig doesn’t support access through JDBC

Question 5

Q

What is the simplest way to make sure the metadata under Glue Data Catalog is always up-to-date and in-sync with the underlying data without your intervention each time?

1) Schedule crawlers to run periodically
2) Using Glue API
3) Using the AWS Console

Answer

A

1) Crawlers may be easily scheduled to run periodically while defining them.

Question 6

Q

Which programming languages can be used to write ETL code for AWS Glue?

1) Python and Java
2) Python and Scala
3) Python and Node.JS
4) Scala and C#

Answer

A

2) Python and Scala are the primary language for Spark - Glue ETL runs on Apache Spark under the hood

Question 7

Q

How can you be notified of the execution of AWS Glue jobs?

1) Cloudwatch + SNS
2) Cloudwatch + SES
3) Glue events + SES

Answer

A

1) AWS Glue outputs its progress into CloudWatch, which in turn may be integrated with the SNS

Question 8

Q

You are going to be working with objects arriving in S3. Once they arrive you want to use AWS Lambda as a part of an AWS Data Pipeline to process and transform the data. How can you easily configure Lambda to know the data has arrived in a bucket?

1) Run a cron job to check for new objects arrive in S3
2) Configure S3 bucket notifications to Lambda
3) Configure S3 versioning to include writing logs to Lambda

Answer

A

2) S3 has the ability to trigger a Lambda function whenever a new object appears in a bucket.

Question 9

Q

You are going to analyze the data coming in an Amazon Kinesis stream. You are going to use Lambda to process these records. What is a prerequisite when it comes to defining Lambda to access Kinesis stream records ?

1) The Kinesis stream should be in the same account
2) The stream should not be older than 3 hours
3) The Kinesis should have its policy document to allow Lambda function

Answer

A

1) Lambda must be in the same account as the service triggering it, in addition to having an IAM policy granting it access.
3) is not correct since It’s actually Lambda that needs an IAM policy allowing Kinesis access, not the other way around.

Question 10

Q

You are creating a Lambda - Kinesis stream environment in which Lambda is to check for the records in the stream and do some processing in its Lambda function. How does Lambda know there has been changes / updates to the Kinesis stream ?
1) Kinesis streams has options to configure cronjobs to invoke Lambda functions

2) Cloudwatch logs notify Lambda
3) Kinesis streams notify Lambda
4) Lambda polls Kinesis streams

Answer

A

4) Lambda polls Kinesis stream for new activity

Question 11

Q

When using an Amazon Redshift database loader, how does Lambda keep track of files arriving in S3 to be processed and sent to Redshift ?

1) In a DynamoDB table
2) In Lambda memory
3) In Lambda parameters
4) In Redshift cluster memory

Answer

A

1) In a DynamoDB table

Question 12

Q

What is Glue Job Bookmark? What does it do?

Answer

A

Job Bookmarks

Persist state from the job run
Prevent reprocessing of old data
Allows you to process new data only when re-running on a schedule
Works with S3 sources in variety of forms
Works with relational databases via JDBC if PKs are in sequential order
- Only handles new rows, not updating new rows

Question 13

Q

Can I do streaming processing with Glue ETL?

Answer

A

Yes, Glue ETL supports serverless streaming processing as of 4/2020.

Glue ETL can now

Consume from Kinesis or Kafka
Clean and transform in-flight
Store results into S3 or other data sources
Runs on Apache Spark Structured Streaming

Question 14

Q

What is AWS Glue Triggers?

Answer

A

Triggers are Glue Data Catalog objects. You can use triggers to either manually or automatically start one or more crawlers or ETL jobs.

You can accomplish the same thing by defining workflows. Workflows are preferred for creating complex multi-job ETL operations.

Question 15

Q

What is AWS Glue Workflow?

Answer

A

A workflow is a container for a set of related jobs, crawlers, and triggers in AWS Glue.

You can stop / resume a workflow
There is no rollback mechanism

Question 16

Q

What is a Glue development endpoint?

Answer

Study These Flashcards

A

A development endpoint is an environment that you can use to develop and test your AWS Glue scripts.

It can be deployed into your VPC with a private IP, but it can be made accessible from Internet as well
It is assigned a role with permissions to access data sources and create objects inside Glue Data Catalog
you can ssh to your development endpoints

Question 17

Q

What are the three major components of AWS Glue?

Answer

Study These Flashcards

A

AWS Glue Data Catalog - a central metadata repository
an ETL engine that automatically generates Python or Scala code
a flexible scheduler that handles dependency resolution, job monitoring, and retries.

Question 18

Q

What is AWS Glue DynamicFrame?

Answer

Study These Flashcards

A

Dynamic Frame - A distributed table that supports nested data such as structures and arrays.
- Self describing, i.e. each record has both schema and data

You can use both Apache Spark DataFrame and DynamicFrame in ETL script; You can convert between the two

Question 19

Q

What is a AWS Glue Connection?

Answer

Study These Flashcards

A

An AWS Glue connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, VPC information, etc.

Domain 3: Processing Flashcards

(19 cards)