Hive Flashcards

1
Q

What is Hive?

A

Hive is application that runs over the Hadoop framework and provides a SQL like interface for processing/querying data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Where is the default location of Hive’s data in HDFS?

A

/user/hive/warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an External table?

A

An external table is just a table that is stored outside of the hive warehouse. Hive does not manage its storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a managed table?

A

Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are managed and controlled by Hive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Hive partition?

Example?

A

Hive Partition is a way to organize large tables into smaller logical tables based on values of columns. Lets say we had a US census table which contains zipcode, city, state, and other columns. Creating a partition on state splits the table into around 50 partitions, when searching for a zipcode with in a state (state=’CA’ and zipCode =’92704′) results in faster as it need to scan only in a state=CA partition directory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the benefit of partitioning?

A

Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does a partitioned table look like in HDFS?

A

each partition is stored in a different directory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a Hive bucket?

A

A hive Bucket is dividing the data into more manageable form resulting in multiple files or directories. it is used for efficient querying.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does it mean to have data skew and why does this matter when bucketing?

A

data skew means that you have an uneven distribution of data. This matters when bucketing because you need to split things up evenly or otherwise it will take time to compute your queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does a bucketed table look like in HDFS?

A

each bucket is stored in a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Hive metastore?

A

Hive metastore (HMS) is a service that stores metadata of both types of tables (managed and external)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is beeline?

A

Beeline is a thin client that also uses the Hive JDBC driver but instead executes queries through HiveServer2, which allows multiple concurrent client connections and supports authentication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does Cluster Computing refer to?

A

Cluster computing depicts a system that consists of two or more computers or systems, often known as nodes. These nodes work together for executing applications and performing other tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Working Set?

A

A working set is the set of data you working

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does RDD stand for?

A

Resilient Distributed Datasets. It is a immutable distributed collection of objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does it mean when we say an RDD is a collection of objects partitioned across a set of machines?

A

RDD partitions referred to as “splits”. Each partition in the RDD serves as the chunk of data for one task. The Resilience of RDDs in enabled by their lineage. Each RDD contains the instructions for recomputing itself from prior data. This lets Spark recompute individual partitions if they are lost. If our data processing reads from disk, does 3 transformations (3 maps) and then writes to disk, our RDD will contain that information. If some of our partitions are lost after the second transformation, Spark has the information in that lineage to recompute them from prior steps stored in memory (if they exist), or from disk.

17
Q

Why do we say that MapReduce has an acyclic data flow?

A

It completes each step of the process without ever going back to to a previous step.

18
Q

Explain the deficiency in using Hive for interactive analysis on datasets. How does Spark alleviate this problem?

A

Hive has to do MapReduce and read and write to disk. Spark is up to x100 faster because it can do the same thing right from memory.

19
Q

What is the lineage of an RDD?

A

A lineage will keep track of what all transformations has to be applied on that RDD, including the location from where it has to read the data.

20
Q

What are the 4 ways provided to construct an RDD?

A

1) using parallelize method
from a Scala collection

2) use a transformation
(to output the transformed RDD)

3) reading a file
(loading a file creates a RDD)

4)persist of an RDD
returns back the new RDD with new persist level

21
Q

What does it mean to transform an RDD?

A

Transformations produce an RDD from another RDD and so don’t cause execution and changes the data within the RDD provided a function.

22
Q

What does it mean to cache an RDD?

A

It means to store the RDD in the cache of the Worker Node to be able to be reused again without recomputing it.

23
Q

What does it mean to perform a parallel operation on an RDD?

A

RDD operations are executed in parallel on each partition. Tasks are executed on the Worker Nodes where the data is stored.

24
Q

Why does Spark need special tools for shared variables, instead of just declaring, for instance, var counter=0?

A

because of closures, global variables can only be seen by the driver and not the task closure, so each task takes a copy of the global variables and updates it independently from other worker nodes.

25
Q

What is a broadcast variable?

A

Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only.

26
Q

What is an accumulator?

A

Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum operations

27
Q

What are some transformations available on an RDD?

A

Maps, filter, Reducebykey

28
Q

What are some actions available on an RDD?

A

count, collect, take, top

29
Q

What is a shuffle in Spark?

A

A shuffle occurs when data is rearranged between partitions. This is required when a transformation requires information from other partitions, such as summing all the values in a column.

30
Q

How can we see the lineage of an RDD?

A

toDebugString