Revature Hive Flashcards

1
Q

Where is the default location of Hive’s data in HDFS?

A

The default location of Hive’s data in HDFS is typically in the ‘/user/hive/warehouse’ directory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is an External table?

A

An External table in Hive is a table that references data stored outside Hive’s default warehouse directory, allowing you to manage external data without modifying the underlying files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a managed table?

A

A Managed table (also known as an internal table) in Hive is one where both the metadata and the data are managed by Hive. When the table is dropped, both the table and the data are deleted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Hive partition?

A

A Hive partition is a way to divide a table into segments based on the value of a column, enabling more efficient querying by restricting the scan to relevant partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Provide an example of a good column or set of columns to partition on.

A

Good examples of columns to partition on include ‘date’, ‘year’, ‘month’, ‘region’, or ‘department’ in case of time-based or geographical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the benefit of partitioning?

A

Partitioning helps in optimizing query performance by reducing the amount of data read during queries and also makes data management easier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does a partitioned table look like in HDFS?

A

In HDFS, a partitioned table has a directory structure with subdirectories for each partition. The data for each partition is stored in these subdirectories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a Hive bucket?

A

A Hive bucket is a way of dividing data into manageable files or groups within a table using a hash function on a column. This helps to optimize queries for specific operations like joins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does it mean to have data skew and why does this matter when bucketing?

A

Data skew occurs when certain values in a bucketed column are heavily concentrated, causing uneven data distribution across buckets. This can lead to performance issues as some buckets may contain more data than others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does a bucketed table look like in HDFS?

A

A bucketed table in HDFS will have multiple files within the table’s directory, corresponding to the number of buckets. These files represent the data split by the bucketing column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Hive metastore?

A

The Hive metastore is a centralized repository that stores metadata about Hive tables, such as schema, partitions, column types, and other attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is beeline?

A

Beeline is a command-line interface (CLI) tool for connecting to HiveServer2. It is used for running Hive queries and managing Hive sessions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is hive?

A

Hive is a dataware house tool built on top of hadoop. It is used to analyze and query large data sets using HQL a sql. Essentially it is an abstraction of the MapReduce program in distributed applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly