Domain 3: Processing Flashcards

1
Q

Of the following tools with Amazon EMR, which one is used for querying multiple data stores at once?

1) Presto
2) Hue
3) Ganglia
4) Ambari

A

1) Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which one of the following statements is NOT TRUE regarding EMR Notebooks?

1) EMR notebook is stopped if is idle for an extended time
2) EMR notebook integrates with repositories for version control, including GitHub, CodeCommit, and BitBucket
3) EMR Notebooks can be opened without logging into the AWS Management Console
4) You can not attach your notebook to a Kerberos enabled EMR cluster

A

3) To create or open a notebook and run queries on your EMR cluster you need to log into the AWS Management Console.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When you delete your EMR cluster, what happens to the EBS volumes?

1) EMR will delete the volumes once the EMR cluster is terminated
2) EBS volumes are preserved

A

2) EBS volumes will be deleted. f you don’t want the data on your cluster to be ephemeral, be sure to store or copy it in S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which one of the following statements is NOT TRUE regarding Apache Pig?

1) Pig supports interactive and batch cluster types
2) Pig is operated by a SQL-like language called Pig Latin
3) When used with Amazon EMR, Pig allows accessing multiple file systems
4) Pig supports access through JDBC

A

4) Pig doesn’t support access through JDBC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the simplest way to make sure the metadata under Glue Data Catalog is always up-to-date and in-sync with the underlying data without your intervention each time?

1) Schedule crawlers to run periodically
2) Using Glue API
3) Using the AWS Console

A

1) Crawlers may be easily scheduled to run periodically while defining them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which programming languages can be used to write ETL code for AWS Glue?

1) Python and Java
2) Python and Scala
3) Python and Node.JS
4) Scala and C#

A

2) Python and Scala are the primary language for Spark - Glue ETL runs on Apache Spark under the hood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you be notified of the execution of AWS Glue jobs?

1) Cloudwatch + SNS
2) Cloudwatch + SES
3) Glue events + SES

A

1) AWS Glue outputs its progress into CloudWatch, which in turn may be integrated with the SNS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

You are going to be working with objects arriving in S3. Once they arrive you want to use AWS Lambda as a part of an AWS Data Pipeline to process and transform the data. How can you easily configure Lambda to know the data has arrived in a bucket?

1) Run a cron job to check for new objects arrive in S3
2) Configure S3 bucket notifications to Lambda
3) Configure S3 versioning to include writing logs to Lambda

A

2) S3 has the ability to trigger a Lambda function whenever a new object appears in a bucket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

You are going to analyze the data coming in an Amazon Kinesis stream. You are going to use Lambda to process these records. What is a prerequisite when it comes to defining Lambda to access Kinesis stream records ?

1) The Kinesis stream should be in the same account
2) The stream should not be older than 3 hours
3) The Kinesis should have its policy document to allow Lambda function

A

1) Lambda must be in the same account as the service triggering it, in addition to having an IAM policy granting it access.
3) is not correct since It’s actually Lambda that needs an IAM policy allowing Kinesis access, not the other way around.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You are creating a Lambda - Kinesis stream environment in which Lambda is to check for the records in the stream and do some processing in its Lambda function. How does Lambda know there has been changes / updates to the Kinesis stream ?
1) Kinesis streams has options to configure cronjobs to invoke Lambda functions

2) Cloudwatch logs notify Lambda
3) Kinesis streams notify Lambda
4) Lambda polls Kinesis streams

A

4) Lambda polls Kinesis stream for new activity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When using an Amazon Redshift database loader, how does Lambda keep track of files arriving in S3 to be processed and sent to Redshift ?

1) In a DynamoDB table
2) In Lambda memory
3) In Lambda parameters
4) In Redshift cluster memory

A

1) In a DynamoDB table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Glue Job Bookmark? What does it do?

A

Job Bookmarks

  • Persist state from the job run
  • Prevent reprocessing of old data
  • Allows you to process new data only when re-running on a schedule
  • Works with S3 sources in variety of forms
  • Works with relational databases via JDBC if PKs are in sequential order
    • Only handles new rows, not updating new rows
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Can I do streaming processing with Glue ETL?

A

Yes, Glue ETL supports serverless streaming processing as of 4/2020.

Glue ETL can now

  • Consume from Kinesis or Kafka
  • Clean and transform in-flight
  • Store results into S3 or other data sources
  • Runs on Apache Spark Structured Streaming
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is AWS Glue Triggers?

A

Triggers are Glue Data Catalog objects. You can use triggers to either manually or automatically start one or more crawlers or ETL jobs.

You can accomplish the same thing by defining workflows. Workflows are preferred for creating complex multi-job ETL operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is AWS Glue Workflow?

A

A workflow is a container for a set of related jobs, crawlers, and triggers in AWS Glue.

  • You can stop / resume a workflow
  • There is no rollback mechanism
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Glue development endpoint?

A

A development endpoint is an environment that you can use to develop and test your AWS Glue scripts.

  • It can be deployed into your VPC with a private IP, but it can be made accessible from Internet as well
  • It is assigned a role with permissions to access data sources and create objects inside Glue Data Catalog
  • you can ssh to your development endpoints
17
Q

What are the three major components of AWS Glue?

A
  1. AWS Glue Data Catalog - a central metadata repository
  2. an ETL engine that automatically generates Python or Scala code
  3. a flexible scheduler that handles dependency resolution, job monitoring, and retries.
18
Q

What is AWS Glue DynamicFrame?

A

Dynamic Frame - A distributed table that supports nested data such as structures and arrays.
- Self describing, i.e. each record has both schema and data

  • You can use both Apache Spark DataFrame and DynamicFrame in ETL script; You can convert between the two
19
Q

What is a AWS Glue Connection?

A

An AWS Glue connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, VPC information, etc.