Domain 3: Processing Flashcards

1
Q

Which S3 to EMR copy command provides compression?

A

S3DistCp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a DAG?

A

Directed Acyclic Graph - key feature of Spark that caches data in memory providing faster performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Hive?

A

Open source, data warehouse and analytic package that runs on a Hadoop cluster. Uses a SQL like language called Hive QL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is HBase?

A

Open source, non-relational, distributed database on Hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can Hive and HBase interact?

A

HBase can be queried using Hive QL and can be joined with Hive-based tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is HCatalog?

A

Allows you to access Hive metastore tables within Pig, Spark SQL, or custom MR applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are Glue Crawler’s charged?

A

By the minute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where is the Hive Metastore stored by default?

A

MySQL on the master node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where else can the Hive Metastore be stored?

A

Glue Catalog, RDS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Pig?

A

A SQL-like syntax tool to write MapReduce code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What AWS service is very similar to HBase?

A

DynamoDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Splunk?

A

Collects and gathers index data about performance of EMR cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What tool can be used to browse and move data between HDFS and S3?

A

Hue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Flume?

A

Streams log data from web servers (think log fumes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is MXNet?

A

Deep learning tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is S3DistCP?

A

Tool for moving large amounts of data to/from S3 to/from HDFS?

17
Q

Can S3DistCP can be used across buckets and accounts?

A

Yes

18
Q

What is Ganglia?

A

Monitoring tool

19
Q

What is Mahout?

A

Machine learning

20
Q

What is Accumulo?

A

NoSQL database

21
Q

What is Sqoop?

A

Relational database connector

22
Q

What is HCatalog?

A

Table and storage management for Hive metastore

23
Q

What is Tachyon?

A

Accelerator for Spark

24
Q

What is Derby?

A

Open source relational DB in Java

25
Q

What is Ranger?

A

Data security manager for Hadoop

26
Q

What is Kerberos?

A

Network authentication protocol for EMR

27
Q

What is the maximum run time for a Lambda function?

A

900 seconds (15 minutes)

28
Q

What is the minimum batching time for AWS Glue ETL jobs?

A

5 minutes

29
Q

Is EMR ACID compliant?

A

No

30
Q

Are EMR Clusters permanent or ephemeral by default?

A

Ephemeral - they will terminate as soon as complete unless otherwise stated (with an -alive flag)

31
Q

What does EMR use to communicate with EMR Notebooks?

A

Livy