Week 7: HIVE and Apache SparkSQL Flashcards
HIVE
It’s an open-source data warehousing solution built on top of Hadoop. It can support queries written in HiveQL, which is similar to SQL, as well as queries that are compiled into MapReduce jobs that are executed using Hadoop. HIVE is like a wrapper for people familiar with SQL to help them do big data processing in the context of data warehousing.
Pros:
1. An easy way to process large scale data.
2. Supports SQL-based queries.
3. Provides more user defined interfaces to extend programmability.
4. Efficient execution of plans for performance interoperability with other databases.
Cons:
1. No easy way to append data.
2. Files in HDFS are immutable.
HIVE: Applications
Common applications include: log processing, text mining, document indexing, customer-facing business intelligence (Google Analytics), predictive modelling, hypothesis testing
HIVE Components: Client Components
Command Line Interface (CLI), the web UI, JDBC/ODBC driver.
HIVE Components: Driver
The driver manages the lifecycle of a HiveQL statement as it moves through HIVE, maintains a session handle, and maintains session statistics.
HIVE Components: Compiler
It compiles HiveQL into MapReduce tasks.
HIVE Components: Optimiser
Optimises the tasks (improves HiveQL)
HIVE Components: Executor
It executes the tasks in the proper order and interacts with Hadoop.
HIVE Components: Metastore
It serves as the system catalogue and stores information about tables, partitions, locations in the HDFS and etc. It runs on an RDBMS, not on a HDFS, as it needs to have very low latency.
HIVE Components: Thrift Server
It provides an interface between the clients and the Metastore, allowing the clients to query or modify the information in the Metastore.
Data Units of the Data Model: Table
Each table consists of a number of rows, and each row has a specified number of columns.
Example code:
CREATE TABLE t1(st string, fl float, li list<map<string,struct<p1:int,p2:int»);
Data Units of the Data Model: Partition
Partitions come from decomposing a table into partitions, based on values. Making partitions speeds up querying, as only relevant data in Hadoop are scanned.
Example code:
CREATE TABLE test_part(c1 string, c2 int)
PARTITIONED BY (ds string, hr int);
ALTER TABLE test_part
ADD PARTITION(ds=’2009-02-02’,hr=11);
Data Units of the Data Model: Bucket
Buckets come from decomposing a table into buckets. These are useful for sampling, especially when using partitions will result in them being too numerous and too small.
Example code:
CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
PARTITIONED BY (dt STRING)
CLUSTERED BY (user_id) INTO 96 BUCKETS;
Example code:
SELECT product_id, sum(price)
FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
GROUP BY product_id
HIVE: Mapping Data Units
The following workflow is used to map data units into the HDFS name space:
- A table is stored in a directory in HDFS.
- A partition of the table is stored in a subdirectory within a table’s HDFS directory.
- A bucket is stored in a file within the partition’s or table’s directory depending on whether the table is a partitioned table or not.
Hive prunes the data by scanning only the required sub-directories tied to relevant partitions.
Hive uses the file corresponding to a bucket, making bucketing useful for sampling.
HIVE Query Limitations
- HIVE doesn’t support inserting into an existing table or data partition and all inserts overwrite existing data.
- Only equality predicates are supported in JOIN.
Otherwise, there are plenty of SQL commands supported.
HiveQL: Selecting a Database
USE database;
HiveQL: Listing Databases
SHOW DATABASES;