SQL on Hadoop and Spark Flashcards
SQL-on-Hadoop
tools that combine familiar SQL interface with scalability and flexibility of big data processing frameworks
Why is SQL special?
Users can leverage SQL knowledge to interact with large datasets, without learning new paradigms like MapReduce or Spark
Batch SQL
SQL-like queries translated into MapReduce/Spark jobs
(load datasets in memory then query)
Batch SQL query tool examples (2)
Apache Hive, Spark SQL
Interactive SQL
Tools that enable low-latency, interactive querying (enable traditional BI and analytics)
Interactive SQL query tool examples (2)
Apache Impala, Apache Drill
Operational SQL
Tools supporting small, more frequent writes and queries with fast response times (OLTP)
Examples of OLTP workloads
insert, update, delete operations (small, more frequent queries)
Operational SQL query tool examples (2)
NoSQL, HBase
Apache Hive
Provides data warehouse-like abstraction over Hadoop, enabling SQL-like queries with HiveSQL
What specifically does HiveSQL do?
Translate SQL-like queries into MapReduce or Spark jobs
Key features of Apache Hive (3) (SMO)
- Schema on read (not like real DW)
- Metastore that uses RDBMS (on single node)
- Organize data into units
What are the 4 data units in Hive?
- Databases (Folder in HDFS)
- Tables (set of files in HDFS folder) - Set of
records with same schema
Optional:
3. Partitions (of table records based off column value) - faster, better for low cardinality
4. Buckets (group table records into fixed number of files) - high cardinality
AWS version of Apache Hive
Athena, allows querying over S3 data
Spark SQL
Allos users to run SQL queries on data stored in Spark structures
Spark SQL Dataframe and Dataset
Essentially RDD’s with a schema attached to support relational (and procedural) processing
Query optimization
When you use SQL with spark, your SQL queries go through several optimization steps before being executed
Catalyst Optimizer
Generates optimized execution plans for SQL like queries, restructuring the query plan to make it more efficient
3 types of Catalyst Optimizers;
- Predicate Pushdown (filters/where clauses)
- Column Pruning
- JVM Code Generation (bytecode to reduce overhead of running queries)
Pro and Con of using SQL over Spark
Pro: Language simplicity (easier to optimize SQL query than user defined function)
Con: Structure imposes limits, RDD’s typically enable any computation through user defined functions
Logical vs Physical plans for optimizing queries
logical describes computations, physical outlines which algorithms are used to conduct them
Constant Folding (Logical Optimization Rules)
Resolves constant expressions (non variable, 2 + 3) at compile time instead of runtime
Predicate Pushdown (Logical Optimization Rules)
Push filter conditions as close to data source as possible (where department =) So that Spark only reads the where clause, not the full query
Column Pruning (Logical Optimization Stage Rules)
Select only necessary columns
Join Reordering (Logical Optimization Stage Rules)
Reordering joins to process smallest tables first
Spark can apply logical optimization rules both ____________ and ____________ until plan reaches a fixed point
recursively and iteratively
Physical Optimization (2)
Cost model, join methods
How does spark sql decide how to physically execute the optimization?
A cost model is use to select the best one
Ex:
Cost = a x cost(cpu) + (1-a) x cost(IO)
Physical Optimization Join Methods (3)
- Broadcast hash (smaller table loaded into memory on each node where larger table has partitions)
- Shuffle hash (Both tables partitioned, relevant partition groups are sent to the same node. Hash table is created locally on each node for smaller table partition, saved in memory, then scan through bigger table for matches )
- Shuffle sort merge (partitions shuffled just like in shuffle hash, but instead of using hash table each partition is sorted by join key. Scan through both tables and merge where join keys match)
4 Factors that determine the cost of physical optimization (NDCD)
Network throughput
Disk throughput
CPU cost
Data locality
Adaptive Query Execution (AQE)
In Spark 3.0 dynamically optimizes execution plan based on runtime statistics (Think of as another layer on top of spark catalyst optimizer)
Things AQE can impact
Adaptive number of shuffle partitions instead of fixed
Switch type of join
Optimize skewed joins?
Dynamic Partition pruning?