Processing Flashcards

1
Q

You are going to be working with objects arriving in S3. Once they arrive you want to use AWS Lambda as a part of an AWS Data Pipeline to process and transform the data. How can you easily configure Lambda to know the data has arrived in a bucket?

A

Configure S3 bucket notifications to Lambda (Lambda functions are generally invoked by some sort of trigger. S3 has the ability to trigger a Lambda function whenever a new object appears in a bucket.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You are going to analyze the data coming in an Amazon Kinesis stream. You are going to use Lambda to process these records. What is a prerequisite when it comes to defining Lambda to access Kinesis stream records ?

A

The Kinesis stream should be in the same account (Lambda must be in the same account as the service triggering it, in addition to having an IAM policy granting it access.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you make sure your Lambda functions have access to the other resources you are using in your big data architecture like S3, Redshift, etc.?

A

Using proper IAM roles (IAM roles define the access a Lambda function has to the services it communicates with.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You are creating a Lambda - Kinesis stream environment in which Lambda is to check for the records in the stream and do some processing in its Lambda function. How does Lambda know there has been changes / updates to the Kinesis stream ?

A

Lambda polls Kinesis streams (Although you think of a trigger as “pushing” events, Lambda actually polls your Kinesis streams for new activity.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When using an Amazon Redshift database loader, how does Lambda keep track of files arriving in S3 to be processed and sent to Redshift ?

A

In a DynamoDB table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

You want to load data from a MySQL server installed in an EC2 t2.micro instance to be processed by AWS Glue. What applies the best here?

A

Instance should be in your VPC (Although we didn’t really discuss access controls, you could arrive at this answer through process of elimination. You’ll find yourself doing that on the exam a lot. This isn’t really a Glue specific question; it’s more about how to connect an AWS service such as Glue to EC2.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the simplest way to make sure the metadata under Glue Data Catalog is always up-to-date and in-sync with the underlying data without your intervention each time?

A

Schedule crawlers to run periodically (Crawlers may be easily scheduled to run periodically while defining them.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which programming languages can be used to write ETL code for AWS Glue?

A

Python and Scala (Glue ETL runs on Apache Spark under the hood, and these happen to be the primary languages used for Spark development.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Can you run existing ETL jobs with AWS Glue?

A

Yes (You can run your existing Scala or Python code on AWS Glue. Simply upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you be notified of the execution of AWS Glue jobs?

A

Using Cloudwatch + SNS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Of the following tools with Amazon EMR, which one is used for querying multiple data stores at once?

  • Presto
  • Hue
  • Ganglia
  • Ambari
A

Presto

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which one of the following statements is NOT TRUE regarding EMR Notebooks?

  • EMR Notebook is stopped if is idle for an extended time
  • EMR Notebooks currently do not integrate with repositories for version control
  • EMR Notebooks can be opened without logging into the AWS Management Console
  • You cannot attach your notebook to a Kerberos enabled EMR cluster
A

EMR Notebooks can be opened without logging into the AWS Management Console

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you get a history of all EMR API calls made on your account for security or compliance auditing?

A

Using AWS CloudTrail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When you delete your EMR cluster, what happens to the EBS volumes?

A

EMR will delete the volumes once the EMR cluster terminated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which one of the following statements is NOT TRUE regarding Apache Pig?

  • Pig supports interactive and batch cluster types
  • Pig is operated by a SQL-like language called Pig Latin
  • When used with Amazon EMR, Pig allows accessing multiple filesystems
  • Pig supports access through JDBC
A

Pig supports access through JDBC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What limit, if any, is there to the size of your training dataset in Amazon Machine Learning by default?

A

100 GB (By default, Amazon ML is limited to 100GB of training data. You can file a support ticket to get this increased, but Amazon ML cannot handle terabyte-scale data.)

17
Q

The audit team of an organization needs a history of Amazon SageMaker API calls made on their account for security analysis and operational troubleshooting purposes. Which of the following service helps in this regard?

A

CloudTrail (SageMaker outputs its results to both CloudTrail and CloudWatch, but CloudTrail is specifically designed for auditing purposes.)

18
Q

Is there a limit to the size of the dataset that you can use for training models with Amazon SageMaker? If so, what is the limit?

A

No fixed limit (There are no fixed limits to the size of the dataset you can use for training models with Amazon SageMaker.)

19
Q

Which of the following is a new Amazon SageMaker capability that enables machine learning models to train once and run anywhere in the cloud and at the edge?

  • SageMaker Neo
  • SageMaker Search
  • Batch Transform
  • Jupyter Notebooks
A

SageMaker Neo

20
Q

A Python developer is planning to develop a machine learning model to predict real estate prices using a Jupyter notebook and train and deploy this model in a high available and scalable manner. The developer wishes to avoid worrying about provisioning sufficient capacity for this model. Which of the following services is best suited for this?

A

Amazon SageMaker (SageMaker is the only scalable solution that is both fully managed and uses Jupyter notebooks.)

21
Q

Which open-source Web interface provides you with a easy way to run scripts, manage the Hive metastore, and view HDFS?

  • Apache Zeppelin
  • Ganglia
  • YARN Resource Manager
  • Hue
A

-Hue
(Hue (Hadoop User Experience) is an open-source, web-based, graphical user interface for use with Amazon EMR and Apache Hadoop. Hue groups together several different Hadoop ecosystem projects into a configurable interface for your Amazon EMR cluster. Further information: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hue.html)

22
Q

Which of the following are the 4 modules (libraries) of Spark? (Choose 4)

  • Apache Mesos
  • SparkSQL
  • Spark Streaming
  • GraphX
  • MLlib
  • YARN
A
  • SparkSQL
  • Spark Streaming
  • GraphX
  • MLlib
23
Q

Which of the following does Spark Streaming use to consume data from a Kinesis Stream?

  • Kinesis Client Library
  • Kinesis Consumer Library
  • Kinesis Connector Library
  • Kinesis Producer Library
A

Kinesis Client Library
(Spark Streaming uses the Kinesis Client Library (KCL) to consume data from a Kinesis stream. KCL handles complex tasks like load balancing, failure recovery, and check-pointing. Further information: https://spark.apache.org/docs/latest/streaming-kinesis-integration.html)

24
Q

True or False: EBS volumes used with EMR persist after the cluster is terminated.

A

False
(When EBS volumes are used with EMR, the volumes do not persist after cluster termination. Compare this to how EBS behaves when used with an ordinary RC2 instance: it is possible for the volume to persist after its instance is terminated.)

25
Q

You have just joined a company that has a petabyte of data stored in multiple data sources. The data sources include Hive, Cassandra, Redis, and MongoDB. The company has hundreds of employees all querying the data at a high concurrency rate. These queries take between a sub-second and several minutes to run. The queries are processed in-memory, and avoid high I/O and latency. A lot of your new colleagues are also happy they did not have to learn a new language when querying the multiple data sources. Which open-source tool do you think your new colleagues are using?

  • Presto
  • Big Data Query Engine
  • Hive
  • SparkSQL
A

Presto

(Presto is a fast, open-source, in-memory, distributed SQL query engine. Since it uses ANSI SQL, you don’t have to learn another language to use it. It is used to run interactive analytic queries against a variety of data sources with sizes ranging from GBs to PBs. These data sources include Cassandra, Hive, Kafka, MongoDB, MtSQL, PostgresSql, Redis, and a number of others. Presto is also significantly faster than Hive. Further information: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html)

26
Q

When should you not use Spark? (Choose 2)

  • In multi-user environments with high concurrency
  • For ETL workloads
  • For interactive analytics
  • For batch processing
A
  • In multi-user environments with high concurrency
  • For batch processing

(Do not use Spark for batch processing. With Spark, there is minimal disk I/O, and the data being queried from the multiple data stores needs to fit into memory. Queries that require a lot of memory can fail. For large batch jobs, consider Hive. Also avoid using Spark for multi-user reporting with many concurrent requests.)

27
Q

How are EMR tasks nodes different from core nodes? (Choose 3)

  • Task nodes run the NodeManager daemon.
  • Task nodes are optional.
  • Task nodes run the Resource Manager.
  • They are used for extra capacity when additional CPU and RAM are needed.
  • Task nodes do not include HDFS.
A
  • Task nodes are optional.
  • They are used for extra capacity when additional CPU and RAM are needed.
  • Task nodes do not include HDFS.

(ask nodes are used for extra capacity when additional CPU and RAM are needed. They are optional and do not include HDFS. Further information: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html#w2ab1c18c25c17)

28
Q

You plan to use EMR to process a large amount of data that will eventually be stored in S3. The data is currently on-premise, and will be migrated to AWS using the Snowball service. The file sizes range from 300 MB to 500 MB. Over the next 6 months, your company will migrate over 2 PB of data to S3 and costs are a concern. Which compression algorithm provides you with the highest compression ratio, allowing you to both maximize performance minimize costs?

  • bzip2
  • LZO
  • Snappy
  • GZIP
A

bzip2

Of the possible selections, the bzip2 compression algorithm has the highest compression ratio.

29
Q

True or False: Presto is a database.

A

False
False
(Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.)

30
Q

You have just joined a new company as an AWS Big Data Architect, replacing an architect who left to join a different company. As a data driven company, your company has started using several of AWS’ Big Data services in the last 6 months. Your new manager is concerned that the AWS charges are too high, and she has asked you to review the monthly bills. After review, you determine that the EMR costs are unnecessarily high considering the company uses EMR to process new data within a 6 hour period that starts at midnight and ends between 5 AM and 7 AM, depending on the amount of data that needs to be processed. The data that needs to be processed is already in S3. However, it appears that the EMR cluster that processes the data is running 24 hours a day, 7 days a week. What type of cluster should your predecessor have configured in order to keep costs low and not unnecessarily waste resources?

  • Nothing. AWS announces frequent price reductions, and costs will balance-out over time.
  • He should have used auto-scaling to reduce the number of core nodes and task nodes running when no processing is taking place.
  • Your predecessor should have configured the cluster as a transient cluster.
  • To reduce costs, your predecessor should have purchased reserved instances for the EMR cluster.
A

-Your predecessor should have configured the cluster as a transient cluster.

(With a transient cluster, the input data is loaded, processed, the output data is stored in S3 and the cluster is automatically terminated. Shutting-down the cluster automatically ensures that you are only billed for the time required to process your data.)

31
Q

Your EMR cluster requires high I/O performance and at a low cost. In terms of storage, which of the following is your best option?

  • EBS volumes with PIOPS
  • Instance store volumes
  • EMRFS with consistent view
  • EMRFS
A

-Instance store volumes

(D2 and I3 instance types provide you with various options in terms of the amount of instance storage. This instance storage can be used for HDFS if the I/O requirements of the EMR cluster are high.)