Domain 3: Processing Flashcards
Which S3 to EMR copy command provides compression?
S3DistCp
What is a DAG?
Directed Acyclic Graph - key feature of Spark that caches data in memory providing faster performance
What is Hive?
Open source, data warehouse and analytic package that runs on a Hadoop cluster. Uses a SQL like language called Hive QL.
What is HBase?
Open source, non-relational, distributed database on Hadoop
How can Hive and HBase interact?
HBase can be queried using Hive QL and can be joined with Hive-based tables
What is HCatalog?
Allows you to access Hive metastore tables within Pig, Spark SQL, or custom MR applications
How are Glue Crawler’s charged?
By the minute
Where is the Hive Metastore stored by default?
MySQL on the master node
Where else can the Hive Metastore be stored?
Glue Catalog, RDS
What is Pig?
A SQL-like syntax tool to write MapReduce code
What AWS service is very similar to HBase?
DynamoDB
What is Splunk?
Collects and gathers index data about performance of EMR cluster
What tool can be used to browse and move data between HDFS and S3?
Hue
What is Flume?
Streams log data from web servers (think log fumes)
What is MXNet?
Deep learning tool