Hive Flashcards
What is Hive?
Data warehouse system built on top of Hadoop for querying and analyzing large datasets.
It is not a relational database nor it is designed for online transaction processing.
You can access data via SQL-like queries (HiveQL)
What are the data units in Hive?
Databases
Tables
Partitions which speeds up the update and retrieval. Nested sub-directories in HDFS
Buckets
Does Hive support ACID transactions?
Yes but it has to be enabled globally.
What are Hive Components?
User Interface: UI to submit queries.
Server: Accepts requests from different clients and provides it to the Hive Driver.
Driver: Receives the queries.
Compiler: Parses the queries and generates an execution plan
Metastore: Attributes of tables like location in HDFS
Execution Engine: Executes the execution plan created by the compiler
Why is HDFS is not used by the Hive metastore?
The storage system for the metastore should be optimized for online transactions with random accesses and updates. A file system like HDFS is not suited since it is optimized for sequential scans and not for random access.
What is the difference between external table and managed table?
External Table: Hive is responsible for managing only table metadata; not the table data. While dropping the table, Hive just deletes the metadata information leaving the table data untouched.
Managed Table: Hive is responsible for managing the table data. While dropping the table, metadata and table data is deleted from the Hive warehouse.
When should we use SORT BY instead of ORDER BY?
Both do the same thing but SORT BY uses multiple reducers whereas ORDER BY uses a single reducer. Therefore SORT BY should be used to sort huge datasets.
What is the difference between a partition and a bucket in Hive?
Partition is like creating directories. It changes how Hive structures the data storage. Works best when the cardinality of the field is not high, otherwise you will end up having lots of directories.
Cardinality refers to the number of possible values a field can have. (For example country would have like 300 values but timestamp would have billions)
Bucket is like a sub-partition where you split the data into more manageable parts but it is helpful when a field has high cardinality.
What is dynamic partitioning?
Values of partitions are not known before hand
Where does the data of a Hive table gets stored?
Gets stored in an HDFS directory (/user/hive/warehouse) by default
Can Hive be used in OLTP systems?
Hive does not support row-level data insertion so it is not suitable for us in OLTP systems.
What is UDF in Hive?
It is a User Defined Function created with Java that allows to create custom functions