Week 7: HIVE and Apache SparkSQL Flashcards

1
Q

HIVE

A

It’s an open-source data warehousing solution built on top of Hadoop. It can support queries written in HiveQL, which is similar to SQL, as well as queries that are compiled into MapReduce jobs that are executed using Hadoop. HIVE is like a wrapper for people familiar with SQL to help them do big data processing in the context of data warehousing.

Pros:
1. An easy way to process large scale data.
2. Supports SQL-based queries.
3. Provides more user defined interfaces to extend programmability.
4. Efficient execution of plans for performance interoperability with other databases.

Cons:
1. No easy way to append data.
2. Files in HDFS are immutable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

HIVE: Applications

A

Common applications include: log processing, text mining, document indexing, customer-facing business intelligence (Google Analytics), predictive modelling, hypothesis testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HIVE Components: Client Components

A

Command Line Interface (CLI), the web UI, JDBC/ODBC driver.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

HIVE Components: Driver

A

The driver manages the lifecycle of a HiveQL statement as it moves through HIVE, maintains a session handle, and maintains session statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

HIVE Components: Compiler

A

It compiles HiveQL into MapReduce tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

HIVE Components: Optimiser

A

Optimises the tasks (improves HiveQL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

HIVE Components: Executor

A

It executes the tasks in the proper order and interacts with Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HIVE Components: Metastore

A

It serves as the system catalogue and stores information about tables, partitions, locations in the HDFS and etc. It runs on an RDBMS, not on a HDFS, as it needs to have very low latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

HIVE Components: Thrift Server

A

It provides an interface between the clients and the Metastore, allowing the clients to query or modify the information in the Metastore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Units of the Data Model: Table

A

Each table consists of a number of rows, and each row has a specified number of columns.

Example code:
CREATE TABLE t1(st string, fl float, li list<map<string,struct<p1:int,p2:int»);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Units of the Data Model: Partition

A

Partitions come from decomposing a table into partitions, based on values. Making partitions speeds up querying, as only relevant data in Hadoop are scanned.

Example code:
CREATE TABLE test_part(c1 string, c2 int)
PARTITIONED BY (ds string, hr int);

ALTER TABLE test_part
ADD PARTITION(ds=’2009-02-02’,hr=11);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Units of the Data Model: Bucket

A

Buckets come from decomposing a table into buckets. These are useful for sampling, especially when using partitions will result in them being too numerous and too small.

Example code:
CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 96 BUCKETS;

Example code:
SELECT product_id, sum(price)
FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
GROUP BY product_id

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

HIVE: Mapping Data Units

A

The following workflow is used to map data units into the HDFS name space:

  1. A table is stored in a directory in HDFS.
  2. A partition of the table is stored in a subdirectory within a table’s HDFS directory.
  3. A bucket is stored in a file within the partition’s or table’s directory depending on whether the table is a partitioned table or not.

Hive prunes the data by scanning only the required sub-directories tied to relevant partitions.

Hive uses the file corresponding to a bucket, making bucketing useful for sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HIVE Query Limitations

A
  1. HIVE doesn’t support inserting into an existing table or data partition and all inserts overwrite existing data.
  2. Only equality predicates are supported in JOIN.

Otherwise, there are plenty of SQL commands supported.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

HiveQL: Selecting a Database

A

USE database;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HiveQL: Listing Databases

A

SHOW DATABASES;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

HiveQL: Listing Tables in a Database

A

SHOW TABLES;

18
Q

HiveQL: Describing the Format of a Table

A

DESCRIBE (FORMATTED|EXTENDED) table;

19
Q

HiveQL: Creating a Database

A

CREATE DATABASE db_name;

20
Q

HiveQL: Dropping a Database

A

DROP DATABASE db_name (CASCADE);

21
Q

HiveQL: Retrieving Information

A

SELECT from_columns FROM table WHERE conditions;

22
Q

HiveQL: All Values

A

SELECT * FROM table;

23
Q

HiveQL: Some Values

A

SELECT * FROM table WHERE rec_name = “Value”;

24
Q

HiveQL: Multiple Criteria

A

SELECT * from TABLE WHERE rec1 = “value” AND rec2 = “value2”;

25
Q

HiveQL: Selecting Specific Columns

A

SELECT column-name FROM table;

26
Q

HiveQL: Retrieving Unique Output Records

A

SELECT DISTINCT column_name FROM table;

27
Q

HiveQL: Sorting

A

SELECT col1, col2 FROM table ORDER BY col2;

28
Q

HiveQL: Sorting Backwards

A

SELECT col1, col2 FROM table ORDER BY col2 DESC;

29
Q

HiveQL: Counting Rows

A

SELECT COUNT(*) FROM table;

30
Q

HiveQL: Grouping with Counting

A

SELECT owner, COUNT(*) FROM table GROUP BY owner;

31
Q

HiveQL: Maximum Value

A

SELECT MAX(col_name) AS label FROM table;

32
Q

HiveQL: Selecting From Multiple Tables

A

SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)

33
Q

HIVE: JOIN Efficiency

A

The mapper sends all rows with the same key to a single reducer, and the reducer does the join.

If many rows have the same key, then the efficiency drops.

To make JOIN operations more efficient, keep the smaller table data in memory first, join with a chunk of the larger table data each time. As such, everything is done in memory, without the reduce step required in the more common join scenarios.

Example code:
set hive.auto.convert.join=true;
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = ‘AAPL’;

34
Q

Serializer/Deserializer (SerDe)

A

It describes how to load the data from the file into a representation that makes it look like a table. It’s implemented using JAVA. There are several built-in serializers/deserializers.

35
Q

Lazy SerDe

A

Doesn’t fully materialise an object, until individual attributes are necessary. Reduces the overhead to create unnecessary objects in Hive. This increases performance.

36
Q

SparkSQL

A

It’s a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. It allows Spark programmers to leverage the benefits of relational processing, and it lets SQL users call complex analytics libraries in Spark.

SparkSQL runs as a library on top of Spark, exposes interfaces accessible through JDBC and command-line, and exposes the DataFrame API which is accessible through different programming languages.

It tries to bridge the relational processing model with the native RDD’s in Spark by using a DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDD’s, and a highly extensible optimiser, Catalyst.

Pros:
1. Has access to MapReduce, a low-level programming language.
2. Integrated with SQL, a declarative language.
3. Great for ETL (Extract, Transform, and Load) to and from various semi or unstructured data sources, and advanced analytics that are hard to express in relational systems.

Cons:
1. MapReduce isn’t the best fit for data warehousing operations for Big Data.

37
Q

DataFrame

A

It’s a distributed collection of rows with the same schema, similar to a table in RDBMS. It can built from external data sources or RDD’s, support relational operators (e.g. where, grouby) as well as Spark operations. It’s evaluated lazily, with each DataFrame object representing a logical plan to compute a dataset but no execution occurs until the user calls an output operation. This enables optimisation.

DataFrames use operators (i.e. ===), and can also be registered as temporary SQL tables and queried using SQL.

Example code:
ctx = new HiveContext()
users = ctx.table(“users”)
young = users.where(users(“age” < 21)
println(young.count())

38
Q

SparkSQL: Data Model

A

SparkSQL uses a nested data model based on HIVE. It supports different data types, which allows for modelling data from HIVE, RDBMS, JSON, and native objects in JAVA/Python/Scala.

39
Q

Advantages of DataFrames over Relational Query Languages

A
  1. Code can be rewritten in different languages and benefit from optimisations across the whole logical plan.
  2. Programmers can use control structures (if, loops)
  3. Logical plan is analysed eagerly although query results are computed lazily. SparkSQL reports an error as soon as the user types an invalid line of code.
40
Q

SparkSQL: Querying Native Datasets

A

SparkSQL infers the schema of the native objects of a programming language automatically. This allows running relational operations on existing Spark programmes, and combines RDD’s with external structured data.

41
Q

SparkSQL: Catalyst

A

It’s a query optimiser built on Scala, uses Scala’s pattern matching features, and represents Abstract Syntax Trees (AST’s) and applies rules to manipulate them.

Example code:
tree.transform{
case Add(Literal(c1), Literal(c2)) -> Literal(c1+c2)
}