Hive Flashcards
What is Hive?
Hive is application that runs over the Hadoop framework and provides a SQL like interface for processing/querying data.
Where is the default location of Hive’s data in HDFS?
/user/hive/warehouse
What is an External table?
An external table is just a table that is stored outside of the hive warehouse. Hive does not manage its storage.
What is a managed table?
Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are managed and controlled by Hive.
What is a Hive partition?
Example?
Hive Partition is a way to organize large tables into smaller logical tables based on values of columns. Lets say we had a US census table which contains zipcode, city, state, and other columns. Creating a partition on state splits the table into around 50 partitions, when searching for a zipcode with in a state (state=’CA’ and zipCode =’92704′) results in faster as it need to scan only in a state=CA partition directory.
What’s the benefit of partitioning?
Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files.
What does a partitioned table look like in HDFS?
each partition is stored in a different directory
What is a Hive bucket?
A hive Bucket is dividing the data into more manageable form resulting in multiple files or directories. it is used for efficient querying.
What does it mean to have data skew and why does this matter when bucketing?
data skew means that you have an uneven distribution of data. This matters when bucketing because you need to split things up evenly or otherwise it will take time to compute your queries
What does a bucketed table look like in HDFS?
each bucket is stored in a file
What is the Hive metastore?
Hive metastore (HMS) is a service that stores metadata of both types of tables (managed and external)
What is beeline?
Beeline is a thin client that also uses the Hive JDBC driver but instead executes queries through HiveServer2, which allows multiple concurrent client connections and supports authentication
What does Cluster Computing refer to?
Cluster computing depicts a system that consists of two or more computers or systems, often known as nodes. These nodes work together for executing applications and performing other tasks.
What is a Working Set?
A working set is the set of data you working
What does RDD stand for?
Resilient Distributed Datasets. It is a immutable distributed collection of objects.