SQL on Hadoop and Spark Flashcards

Question 1

Q

SQL-on-Hadoop

Answer

A

tools that combine familiar SQL interface with scalability and flexibility of big data processing frameworks

Question 2

Q

Why is SQL special?

Answer

A

Users can leverage SQL knowledge to interact with large datasets, without learning new paradigms like MapReduce or Spark

Question 3

Q

Batch SQL

Answer

A

SQL-like queries translated into MapReduce/Spark jobs

(load datasets in memory then query)

Question 4

Q

Batch SQL query tool examples (2)

Answer

A

Apache Hive, Spark SQL

Question 5

Q

Interactive SQL

Answer

A

Tools that enable low-latency, interactive querying (enable traditional BI and analytics)

Question 6

Q

Interactive SQL query tool examples (2)

Answer

A

Apache Impala, Apache Drill

Question 7

Q

Operational SQL

Answer

A

Tools supporting small, more frequent writes and queries with fast response times (OLTP)

Question 8

Q

Examples of OLTP workloads

Answer

A

insert, update, delete operations (small, more frequent queries)

Question 9

Q

Operational SQL query tool examples (2)

Answer

A

NoSQL, HBase

Question 10

Q

Apache Hive

Answer

A

Provides data warehouse-like abstraction over Hadoop, enabling SQL-like queries with HiveSQL

Question 11

Q

What specifically does HiveSQL do?

Answer

A

Translate SQL-like queries into MapReduce or Spark jobs

Question 12

Q

Key features of Apache Hive (3) (SMO)

Answer

A

Schema on read (not like real DW)
Metastore that uses RDBMS (on single node)
Organize data into units

Question 13

Q

What are the 4 data units in Hive?

Answer

A

Databases (Folder in HDFS)
Tables (set of files in HDFS folder) - Set of
records with same schema

Optional:
3. Partitions (of table records based off column value) - faster, better for low cardinality
4. Buckets (group table records into fixed number of files) - high cardinality

Question 14

Q

AWS version of Apache Hive

Answer

A

Athena, allows querying over S3 data

Question 15

Q

Spark SQL

Answer

A

Allos users to run SQL queries on data stored in Spark structures

Question 16

Q

Spark SQL Dataframe and Dataset

Answer

A

Essentially RDD’s with a schema attached to support relational (and procedural) processing

Question 17

Q

Query optimization

Answer

A

When you use SQL with spark, your SQL queries go through several optimization steps before being executed

Question 18

Q

Catalyst Optimizer

Answer

A

Generates optimized execution plans for SQL like queries, restructuring the query plan to make it more efficient

Question 19

Q

3 types of Catalyst Optimizers;

Answer

A

Predicate Pushdown (filters/where clauses)
Column Pruning
JVM Code Generation (bytecode to reduce overhead of running queries)

Question 20

Q

Pro and Con of using SQL over Spark

Answer

A

Pro: Language simplicity (easier to optimize SQL query than user defined function)

Con: Structure imposes limits, RDD’s typically enable any computation through user defined functions

Question 21

Q

Logical vs Physical plans for optimizing queries

Answer

A

logical describes computations, physical outlines which algorithms are used to conduct them

Question 22

Q

Constant Folding (Logical Optimization Rules)

Answer

A

Resolves constant expressions (non variable, 2 + 3) at compile time instead of runtime

Question 23

Q

Predicate Pushdown (Logical Optimization Rules)

Answer

A

Push filter conditions as close to data source as possible (where department =) So that Spark only reads the where clause, not the full query

Question 24

Q

Column Pruning (Logical Optimization Stage Rules)

Answer

A

Select only necessary columns

Question 25

Q

Join Reordering (Logical Optimization Stage Rules)

Answer

A

Reordering joins to process smallest tables first

Question 26

Q

Spark can apply logical optimization rules both ____________ and ____________ until plan reaches a fixed point

Answer

A

recursively and iteratively

Question 27

Q

Physical Optimization (2)

Answer

A

Cost model, join methods

Question 28

Q

How does spark sql decide how to physically execute the optimization?

Answer

A

A cost model is use to select the best one

Ex:
Cost = a x cost(cpu) + (1-a) x cost(IO)

Question 29

Q

Physical Optimization Join Methods (3)

Answer

A

Broadcast hash (smaller table loaded into memory on each node where larger table has partitions)
Shuffle hash (Both tables partitioned, relevant partition groups are sent to the same node. Hash table is created locally on each node for smaller table partition, saved in memory, then scan through bigger table for matches )
Shuffle sort merge (partitions shuffled just like in shuffle hash, but instead of using hash table each partition is sorted by join key. Scan through both tables and merge where join keys match)

Question 30

Q

4 Factors that determine the cost of physical optimization (NDCD)

Answer

A

Network throughput
Disk throughput
CPU cost
Data locality

Question 31

Q

Adaptive Query Execution (AQE)

Answer

A

In Spark 3.0 dynamically optimizes execution plan based on runtime statistics (Think of as another layer on top of spark catalyst optimizer)

Question 32

Q

Things AQE can impact

Answer

A

Adaptive number of shuffle partitions instead of fixed
Switch type of join
Optimize skewed joins?
Dynamic Partition pruning?