Week 7: HIVE and Apache SparkSQL Flashcards
HIVE
It’s an open-source data warehousing solution built on top of Hadoop. It can support queries written in HiveQL, which is similar to SQL, as well as queries that are compiled into MapReduce jobs that are executed using Hadoop. HIVE is like a wrapper for people familiar with SQL to help them do big data processing in the context of data warehousing.
Pros:
1. An easy way to process large scale data.
2. Supports SQL-based queries.
3. Provides more user defined interfaces to extend programmability.
4. Efficient execution of plans for performance interoperability with other databases.
Cons:
1. No easy way to append data.
2. Files in HDFS are immutable.
HIVE: Applications
Common applications include: log processing, text mining, document indexing, customer-facing business intelligence (Google Analytics), predictive modelling, hypothesis testing
HIVE Components: Client Components
Command Line Interface (CLI), the web UI, JDBC/ODBC driver.
HIVE Components: Driver
The driver manages the lifecycle of a HiveQL statement as it moves through HIVE, maintains a session handle, and maintains session statistics.
HIVE Components: Compiler
It compiles HiveQL into MapReduce tasks.
HIVE Components: Optimiser
Optimises the tasks (improves HiveQL)
HIVE Components: Executor
It executes the tasks in the proper order and interacts with Hadoop.
HIVE Components: Metastore
It serves as the system catalogue and stores information about tables, partitions, locations in the HDFS and etc. It runs on an RDBMS, not on a HDFS, as it needs to have very low latency.
HIVE Components: Thrift Server
It provides an interface between the clients and the Metastore, allowing the clients to query or modify the information in the Metastore.
Data Units of the Data Model: Table
Each table consists of a number of rows, and each row has a specified number of columns.
Example code:
CREATE TABLE t1(st string, fl float, li list<map<string,struct<p1:int,p2:int»);
Data Units of the Data Model: Partition
Partitions come from decomposing a table into partitions, based on values. Making partitions speeds up querying, as only relevant data in Hadoop are scanned.
Example code:
CREATE TABLE test_part(c1 string, c2 int)
PARTITIONED BY (ds string, hr int);
ALTER TABLE test_part
ADD PARTITION(ds=’2009-02-02’,hr=11);
Data Units of the Data Model: Bucket
Buckets come from decomposing a table into buckets. These are useful for sampling, especially when using partitions will result in them being too numerous and too small.
Example code:
CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 96 BUCKETS;
Example code:
SELECT product_id, sum(price)
FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
GROUP BY product_id
HIVE: Mapping Data Units
The following workflow is used to map data units into the HDFS name space:
- A table is stored in a directory in HDFS.
- A partition of the table is stored in a subdirectory within a table’s HDFS directory.
- A bucket is stored in a file within the partition’s or table’s directory depending on whether the table is a partitioned table or not.
Hive prunes the data by scanning only the required sub-directories tied to relevant partitions.
Hive uses the file corresponding to a bucket, making bucketing useful for sampling.
HIVE Query Limitations
- HIVE doesn’t support inserting into an existing table or data partition and all inserts overwrite existing data.
- Only equality predicates are supported in JOIN.
Otherwise, there are plenty of SQL commands supported.
HiveQL: Selecting a Database
USE database;
HiveQL: Listing Databases
SHOW DATABASES;
HiveQL: Listing Tables in a Database
SHOW TABLES;
HiveQL: Describing the Format of a Table
DESCRIBE (FORMATTED|EXTENDED) table;
HiveQL: Creating a Database
CREATE DATABASE db_name;
HiveQL: Dropping a Database
DROP DATABASE db_name (CASCADE);
HiveQL: Retrieving Information
SELECT from_columns FROM table WHERE conditions;
HiveQL: All Values
SELECT * FROM table;
HiveQL: Some Values
SELECT * FROM table WHERE rec_name = “Value”;
HiveQL: Multiple Criteria
SELECT * from TABLE WHERE rec1 = “value” AND rec2 = “value2”;
HiveQL: Selecting Specific Columns
SELECT column-name FROM table;
HiveQL: Retrieving Unique Output Records
SELECT DISTINCT column_name FROM table;
HiveQL: Sorting
SELECT col1, col2 FROM table ORDER BY col2;
HiveQL: Sorting Backwards
SELECT col1, col2 FROM table ORDER BY col2 DESC;
HiveQL: Counting Rows
SELECT COUNT(*) FROM table;
HiveQL: Grouping with Counting
SELECT owner, COUNT(*) FROM table GROUP BY owner;
HiveQL: Maximum Value
SELECT MAX(col_name) AS label FROM table;
HiveQL: Selecting From Multiple Tables
SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name)
HIVE: JOIN Efficiency
The mapper sends all rows with the same key to a single reducer, and the reducer does the join.
If many rows have the same key, then the efficiency drops.
To make JOIN operations more efficient, keep the smaller table data in memory first, join with a chunk of the larger table data each time. As such, everything is done in memory, without the reduce step required in the more common join scenarios.
Example code:
set hive.auto.convert.join=true;
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = ‘AAPL’;
Serializer/Deserializer (SerDe)
It describes how to load the data from the file into a representation that makes it look like a table. It’s implemented using JAVA. There are several built-in serializers/deserializers.
Lazy SerDe
Doesn’t fully materialise an object, until individual attributes are necessary. Reduces the overhead to create unnecessary objects in Hive. This increases performance.
SparkSQL
It’s a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. It allows Spark programmers to leverage the benefits of relational processing, and it lets SQL users call complex analytics libraries in Spark.
SparkSQL runs as a library on top of Spark, exposes interfaces accessible through JDBC and command-line, and exposes the DataFrame API which is accessible through different programming languages.
It tries to bridge the relational processing model with the native RDD’s in Spark by using a DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDD’s, and a highly extensible optimiser, Catalyst.
Pros:
1. Has access to MapReduce, a low-level programming language.
2. Integrated with SQL, a declarative language.
3. Great for ETL (Extract, Transform, and Load) to and from various semi or unstructured data sources, and advanced analytics that are hard to express in relational systems.
Cons:
1. MapReduce isn’t the best fit for data warehousing operations for Big Data.
DataFrame
It’s a distributed collection of rows with the same schema, similar to a table in RDBMS. It can built from external data sources or RDD’s, support relational operators (e.g. where, grouby) as well as Spark operations. It’s evaluated lazily, with each DataFrame object representing a logical plan to compute a dataset but no execution occurs until the user calls an output operation. This enables optimisation.
DataFrames use operators (i.e. ===), and can also be registered as temporary SQL tables and queried using SQL.
Example code:
ctx = new HiveContext()
users = ctx.table(“users”)
young = users.where(users(“age” < 21)
println(young.count())
SparkSQL: Data Model
SparkSQL uses a nested data model based on HIVE. It supports different data types, which allows for modelling data from HIVE, RDBMS, JSON, and native objects in JAVA/Python/Scala.
Advantages of DataFrames over Relational Query Languages
- Code can be rewritten in different languages and benefit from optimisations across the whole logical plan.
- Programmers can use control structures (if, loops)
- Logical plan is analysed eagerly although query results are computed lazily. SparkSQL reports an error as soon as the user types an invalid line of code.
SparkSQL: Querying Native Datasets
SparkSQL infers the schema of the native objects of a programming language automatically. This allows running relational operations on existing Spark programmes, and combines RDD’s with external structured data.
SparkSQL: Catalyst
It’s a query optimiser built on Scala, uses Scala’s pattern matching features, and represents Abstract Syntax Trees (AST’s) and applies rules to manipulate them.
Example code:
tree.transform{
case Add(Literal(c1), Literal(c2)) -> Literal(c1+c2)
}