SQL on Hadoop and Spark Flashcards

1
Q

SQL-on-Hadoop

A

tools that combine familiar SQL interface with scalability and flexibility of big data processing frameworks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is SQL special?

A

Users can leverage SQL knowledge to interact with large datasets, without learning new paradigms like MapReduce or Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Batch SQL

A

SQL-like queries translated into MapReduce/Spark jobs

(load datasets in memory then query)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Batch SQL query tool examples (2)

A

Apache Hive, Spark SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Interactive SQL

A

Tools that enable low-latency, interactive querying (enable traditional BI and analytics)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Interactive SQL query tool examples (2)

A

Apache Impala, Apache Drill

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Operational SQL

A

Tools supporting small, more frequent writes and queries with fast response times (OLTP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examples of OLTP workloads

A

insert, update, delete operations (small, more frequent queries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Operational SQL query tool examples (2)

A

NoSQL, HBase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Apache Hive

A

Provides data warehouse-like abstraction over Hadoop, enabling SQL-like queries with HiveSQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What specifically does HiveSQL do?

A

Translate SQL-like queries into MapReduce or Spark jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Key features of Apache Hive (3) (SMO)

A
  1. Schema on read (not like real DW)
  2. Metastore that uses RDBMS (on single node)
  3. Organize data into units
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 4 data units in Hive?

A
  1. Databases (Folder in HDFS)
  2. Tables (set of files in HDFS folder) - Set of
    records with same schema

Optional:
3. Partitions (of table records based off column value) - faster, better for low cardinality
4. Buckets (group table records into fixed number of files) - high cardinality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AWS version of Apache Hive

A

Athena, allows querying over S3 data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Spark SQL

A

Allos users to run SQL queries on data stored in Spark structures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Spark SQL Dataframe and Dataset

A

Essentially RDD’s with a schema attached to support relational (and procedural) processing

17
Q

Query optimization

A

When you use SQL with spark, your SQL queries go through several optimization steps before being executed

18
Q

Catalyst Optimizer

A

Generates optimized execution plans for SQL like queries, restructuring the query plan to make it more efficient

19
Q

3 types of Catalyst Optimizers;

A
  1. Predicate Pushdown (filters/where clauses)
  2. Column Pruning
  3. JVM Code Generation (bytecode to reduce overhead of running queries)
20
Q

Pro and Con of using SQL over Spark

A

Pro: Language simplicity (easier to optimize SQL query than user defined function)

Con: Structure imposes limits, RDD’s typically enable any computation through user defined functions

21
Q

Logical vs Physical plans for optimizing queries

A

logical describes computations, physical outlines which algorithms are used to conduct them

22
Q

Constant Folding (Logical Optimization Rules)

A

Resolves constant expressions (non variable, 2 + 3) at compile time instead of runtime

23
Q

Predicate Pushdown (Logical Optimization Rules)

A

Push filter conditions as close to data source as possible (where department =) So that Spark only reads the where clause, not the full query

24
Q

Column Pruning (Logical Optimization Stage Rules)

A

Select only necessary columns

25
Q

Join Reordering (Logical Optimization Stage Rules)

A

Reordering joins to process smallest tables first

26
Q

Spark can apply logical optimization rules both ____________ and ____________ until plan reaches a fixed point

A

recursively and iteratively

27
Q

Physical Optimization (2)

A

Cost model, join methods

28
Q

How does spark sql decide how to physically execute the optimization?

A

A cost model is use to select the best one

Ex:
Cost = a x cost(cpu) + (1-a) x cost(IO)

29
Q

Physical Optimization Join Methods (3)

A
  1. Broadcast hash (smaller table loaded into memory on each node where larger table has partitions)
  2. Shuffle hash (Both tables partitioned, relevant partition groups are sent to the same node. Hash table is created locally on each node for smaller table partition, saved in memory, then scan through bigger table for matches )
  3. Shuffle sort merge (partitions shuffled just like in shuffle hash, but instead of using hash table each partition is sorted by join key. Scan through both tables and merge where join keys match)
30
Q

4 Factors that determine the cost of physical optimization (NDCD)

A

Network throughput
Disk throughput
CPU cost
Data locality

31
Q

Adaptive Query Execution (AQE)

A

In Spark 3.0 dynamically optimizes execution plan based on runtime statistics (Think of as another layer on top of spark catalyst optimizer)

32
Q

Things AQE can impact

A

Adaptive number of shuffle partitions instead of fixed
Switch type of join
Optimize skewed joins?
Dynamic Partition pruning?