Domain 3: Processing Flashcards
Which S3 to EMR copy command provides compression?
S3DistCp
What is a DAG?
Directed Acyclic Graph - key feature of Spark that caches data in memory providing faster performance
What is Hive?
Open source, data warehouse and analytic package that runs on a Hadoop cluster. Uses a SQL like language called Hive QL.
What is HBase?
Open source, non-relational, distributed database on Hadoop
How can Hive and HBase interact?
HBase can be queried using Hive QL and can be joined with Hive-based tables
What is HCatalog?
Allows you to access Hive metastore tables within Pig, Spark SQL, or custom MR applications
How are Glue Crawler’s charged?
By the minute
Where is the Hive Metastore stored by default?
MySQL on the master node
Where else can the Hive Metastore be stored?
Glue Catalog, RDS
What is Pig?
A SQL-like syntax tool to write MapReduce code
What AWS service is very similar to HBase?
DynamoDB
What is Splunk?
Collects and gathers index data about performance of EMR cluster
What tool can be used to browse and move data between HDFS and S3?
Hue
What is Flume?
Streams log data from web servers (think log fumes)
What is MXNet?
Deep learning tool
What is S3DistCP?
Tool for moving large amounts of data to/from S3 to/from HDFS?
Can S3DistCP can be used across buckets and accounts?
Yes
What is Ganglia?
Monitoring tool
What is Mahout?
Machine learning
What is Accumulo?
NoSQL database
What is Sqoop?
Relational database connector
What is HCatalog?
Table and storage management for Hive metastore
What is Tachyon?
Accelerator for Spark
What is Derby?
Open source relational DB in Java
What is Ranger?
Data security manager for Hadoop
What is Kerberos?
Network authentication protocol for EMR
What is the maximum run time for a Lambda function?
900 seconds (15 minutes)
What is the minimum batching time for AWS Glue ETL jobs?
5 minutes
Is EMR ACID compliant?
No
Are EMR Clusters permanent or ephemeral by default?
Ephemeral - they will terminate as soon as complete unless otherwise stated (with an -alive flag)
What does EMR use to communicate with EMR Notebooks?
Livy