SQL on Hadoop and Spark Flashcards
SQL-on-Hadoop
tools that combine familiar SQL interface with scalability and flexibility of big data processing frameworks
Why is SQL special?
Users can leverage SQL knowledge to interact with large datasets, without learning new paradigms like MapReduce or Spark
Batch SQL
SQL-like queries translated into MapReduce/Spark jobs
(load datasets in memory then query)
Batch SQL query tool examples (2)
Apache Hive, Spark SQL
Interactive SQL
Tools that enable low-latency, interactive querying (enable traditional BI and analytics)
Interactive SQL query tool examples (2)
Apache Impala, Apache Drill
Operational SQL
Tools supporting small, more frequent writes and queries with fast response times (OLTP)
Examples of OLTP workloads
insert, update, delete operations (small, more frequent queries)
Operational SQL query tool examples (2)
NoSQL, HBase
Apache Hive
Provides data warehouse-like abstraction over Hadoop, enabling SQL-like queries with HiveSQL
What specifically does HiveSQL do?
Translate SQL-like queries into MapReduce or Spark jobs
Key features of Apache Hive (3) (SMO)
- Schema on read (not like real DW)
- Metastore that uses RDBMS (on single node)
- Organize data into units
What are the 4 data units in Hive?
- Databases (Folder in HDFS)
- Tables (set of files in HDFS folder) - Set of
records with same schema
Optional:
3. Partitions (of table records based off column value) - faster, better for low cardinality
4. Buckets (group table records into fixed number of files) - high cardinality
AWS version of Apache Hive
Athena, allows querying over S3 data
Spark SQL
Allos users to run SQL queries on data stored in Spark structures