Revature Hive Flashcards
Where is the default location of Hive’s data in HDFS?
The default location of Hive’s data in HDFS is typically in the ‘/user/hive/warehouse’ directory.
What is an External table?
An External table in Hive is a table that references data stored outside Hive’s default warehouse directory, allowing you to manage external data without modifying the underlying files.
What is a managed table?
A Managed table (also known as an internal table) in Hive is one where both the metadata and the data are managed by Hive. When the table is dropped, both the table and the data are deleted.
What is a Hive partition?
A Hive partition is a way to divide a table into segments based on the value of a column, enabling more efficient querying by restricting the scan to relevant partitions.
Provide an example of a good column or set of columns to partition on.
Good examples of columns to partition on include ‘date’, ‘year’, ‘month’, ‘region’, or ‘department’ in case of time-based or geographical data.
What’s the benefit of partitioning?
Partitioning helps in optimizing query performance by reducing the amount of data read during queries and also makes data management easier.
What does a partitioned table look like in HDFS?
In HDFS, a partitioned table has a directory structure with subdirectories for each partition. The data for each partition is stored in these subdirectories.
What is a Hive bucket?
A Hive bucket is a way of dividing data into manageable files or groups within a table using a hash function on a column. This helps to optimize queries for specific operations like joins.
What does it mean to have data skew and why does this matter when bucketing?
Data skew occurs when certain values in a bucketed column are heavily concentrated, causing uneven data distribution across buckets. This can lead to performance issues as some buckets may contain more data than others.
What does a bucketed table look like in HDFS?
A bucketed table in HDFS will have multiple files within the table’s directory, corresponding to the number of buckets. These files represent the data split by the bucketing column.
What is the Hive metastore?
The Hive metastore is a centralized repository that stores metadata about Hive tables, such as schema, partitions, column types, and other attributes.
What is beeline?
Beeline is a command-line interface (CLI) tool for connecting to HiveServer2. It is used for running Hive queries and managing Hive sessions.
What is hive?
Hive is a dataware house tool built on top of hadoop. It is used to analyze and query large data sets using HQL a sql. Essentially it is an abstraction of the MapReduce program in distributed applications