Domain 3: Processing Flashcards
Of the following tools with Amazon EMR, which one is used for querying multiple data stores at once?
1) Presto
2) Hue
3) Ganglia
4) Ambari
1) Presto
Which one of the following statements is NOT TRUE regarding EMR Notebooks?
1) EMR notebook is stopped if is idle for an extended time
2) EMR notebook integrates with repositories for version control, including GitHub, CodeCommit, and BitBucket
3) EMR Notebooks can be opened without logging into the AWS Management Console
4) You can not attach your notebook to a Kerberos enabled EMR cluster
3) To create or open a notebook and run queries on your EMR cluster you need to log into the AWS Management Console.
When you delete your EMR cluster, what happens to the EBS volumes?
1) EMR will delete the volumes once the EMR cluster is terminated
2) EBS volumes are preserved
2) EBS volumes will be deleted. f you don’t want the data on your cluster to be ephemeral, be sure to store or copy it in S3.
Which one of the following statements is NOT TRUE regarding Apache Pig?
1) Pig supports interactive and batch cluster types
2) Pig is operated by a SQL-like language called Pig Latin
3) When used with Amazon EMR, Pig allows accessing multiple file systems
4) Pig supports access through JDBC
4) Pig doesn’t support access through JDBC
What is the simplest way to make sure the metadata under Glue Data Catalog is always up-to-date and in-sync with the underlying data without your intervention each time?
1) Schedule crawlers to run periodically
2) Using Glue API
3) Using the AWS Console
1) Crawlers may be easily scheduled to run periodically while defining them.
Which programming languages can be used to write ETL code for AWS Glue?
1) Python and Java
2) Python and Scala
3) Python and Node.JS
4) Scala and C#
2) Python and Scala are the primary language for Spark - Glue ETL runs on Apache Spark under the hood
How can you be notified of the execution of AWS Glue jobs?
1) Cloudwatch + SNS
2) Cloudwatch + SES
3) Glue events + SES
1) AWS Glue outputs its progress into CloudWatch, which in turn may be integrated with the SNS
You are going to be working with objects arriving in S3. Once they arrive you want to use AWS Lambda as a part of an AWS Data Pipeline to process and transform the data. How can you easily configure Lambda to know the data has arrived in a bucket?
1) Run a cron job to check for new objects arrive in S3
2) Configure S3 bucket notifications to Lambda
3) Configure S3 versioning to include writing logs to Lambda
2) S3 has the ability to trigger a Lambda function whenever a new object appears in a bucket.
You are going to analyze the data coming in an Amazon Kinesis stream. You are going to use Lambda to process these records. What is a prerequisite when it comes to defining Lambda to access Kinesis stream records ?
1) The Kinesis stream should be in the same account
2) The stream should not be older than 3 hours
3) The Kinesis should have its policy document to allow Lambda function
1) Lambda must be in the same account as the service triggering it, in addition to having an IAM policy granting it access.
3) is not correct since It’s actually Lambda that needs an IAM policy allowing Kinesis access, not the other way around.
You are creating a Lambda - Kinesis stream environment in which Lambda is to check for the records in the stream and do some processing in its Lambda function. How does Lambda know there has been changes / updates to the Kinesis stream ?
1) Kinesis streams has options to configure cronjobs to invoke Lambda functions
2) Cloudwatch logs notify Lambda
3) Kinesis streams notify Lambda
4) Lambda polls Kinesis streams
4) Lambda polls Kinesis stream for new activity
When using an Amazon Redshift database loader, how does Lambda keep track of files arriving in S3 to be processed and sent to Redshift ?
1) In a DynamoDB table
2) In Lambda memory
3) In Lambda parameters
4) In Redshift cluster memory
1) In a DynamoDB table
What is Glue Job Bookmark? What does it do?
Job Bookmarks
- Persist state from the job run
- Prevent reprocessing of old data
- Allows you to process new data only when re-running on a schedule
- Works with S3 sources in variety of forms
- Works with relational databases via JDBC if PKs are in sequential order
- Only handles new rows, not updating new rows
Can I do streaming processing with Glue ETL?
Yes, Glue ETL supports serverless streaming processing as of 4/2020.
Glue ETL can now
- Consume from Kinesis or Kafka
- Clean and transform in-flight
- Store results into S3 or other data sources
- Runs on Apache Spark Structured Streaming
What is AWS Glue Triggers?
Triggers are Glue Data Catalog objects. You can use triggers to either manually or automatically start one or more crawlers or ETL jobs.
You can accomplish the same thing by defining workflows. Workflows are preferred for creating complex multi-job ETL operations.
What is AWS Glue Workflow?
A workflow is a container for a set of related jobs, crawlers, and triggers in AWS Glue.
- You can stop / resume a workflow
- There is no rollback mechanism