Hive Flashcards

Question 1

Q

What is Hive?

Answer

A

Hive is application that runs over the Hadoop framework and provides a SQL like interface for processing/querying data.

Question 2

Q

Where is the default location of Hive’s data in HDFS?

Answer

A

/user/hive/warehouse

Question 3

Q

What is an External table?

Answer

A

An external table is just a table that is stored outside of the hive warehouse. Hive does not manage its storage.

Question 4

Q

What is a managed table?

Answer

A

Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are managed and controlled by Hive.

Question 5

Q

What is a Hive partition?

Example?

Answer

A

Hive Partition is a way to organize large tables into smaller logical tables based on values of columns. Lets say we had a US census table which contains zipcode, city, state, and other columns. Creating a partition on state splits the table into around 50 partitions, when searching for a zipcode with in a state (state=’CA’ and zipCode =’92704′) results in faster as it need to scan only in a state=CA partition directory.

Question 6

Q

What’s the benefit of partitioning?

Answer

A

Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files.

Question 7

Q

What does a partitioned table look like in HDFS?

Answer

A

each partition is stored in a different directory

Question 8

Q

What is a Hive bucket?

Answer

A

A hive Bucket is dividing the data into more manageable form resulting in multiple files or directories. it is used for efficient querying.

Question 9

Q

What does it mean to have data skew and why does this matter when bucketing?

Answer

A

data skew means that you have an uneven distribution of data. This matters when bucketing because you need to split things up evenly or otherwise it will take time to compute your queries

Question 10

Q

What does a bucketed table look like in HDFS?

Answer

A

each bucket is stored in a file

Question 11

Q

What is the Hive metastore?

Answer

A

Hive metastore (HMS) is a service that stores metadata of both types of tables (managed and external)

Question 12

Q

What is beeline?

Answer

A

Beeline is a thin client that also uses the Hive JDBC driver but instead executes queries through HiveServer2, which allows multiple concurrent client connections and supports authentication

Question 13

Q

What does Cluster Computing refer to?

Answer

A

Cluster computing depicts a system that consists of two or more computers or systems, often known as nodes. These nodes work together for executing applications and performing other tasks.

Question 14

Q

What is a Working Set?

Answer

A

A working set is the set of data you working

Question 15

Q

What does RDD stand for?

Answer

A

Resilient Distributed Datasets. It is a immutable distributed collection of objects.

Question 16

Q

What does it mean when we say an RDD is a collection of objects partitioned across a set of machines?

Answer

A

RDD partitions referred to as “splits”. Each partition in the RDD serves as the chunk of data for one task. The Resilience of RDDs in enabled by their lineage. Each RDD contains the instructions for recomputing itself from prior data. This lets Spark recompute individual partitions if they are lost. If our data processing reads from disk, does 3 transformations (3 maps) and then writes to disk, our RDD will contain that information. If some of our partitions are lost after the second transformation, Spark has the information in that lineage to recompute them from prior steps stored in memory (if they exist), or from disk.

Question 17

Q

Why do we say that MapReduce has an acyclic data flow?

Answer

A

It completes each step of the process without ever going back to to a previous step.

Question 18

Q

Explain the deficiency in using Hive for interactive analysis on datasets. How does Spark alleviate this problem?

Answer

A

Hive has to do MapReduce and read and write to disk. Spark is up to x100 faster because it can do the same thing right from memory.

Question 19

Q

What is the lineage of an RDD?

Answer

A

A lineage will keep track of what all transformations has to be applied on that RDD, including the location from where it has to read the data.

Question 20

Q

What are the 4 ways provided to construct an RDD?

Answer

A

1) using parallelize method
from a Scala collection

2) use a transformation
(to output the transformed RDD)

3) reading a file
(loading a file creates a RDD)

4)persist of an RDD
returns back the new RDD with new persist level

Question 21

Q

What does it mean to transform an RDD?

Answer

A

Transformations produce an RDD from another RDD and so don’t cause execution and changes the data within the RDD provided a function.

Question 22

Q

What does it mean to cache an RDD?

Answer

A

It means to store the RDD in the cache of the Worker Node to be able to be reused again without recomputing it.

Question 23

Q

What does it mean to perform a parallel operation on an RDD?

Answer

A

RDD operations are executed in parallel on each partition. Tasks are executed on the Worker Nodes where the data is stored.

Question 24

Q

Why does Spark need special tools for shared variables, instead of just declaring, for instance, var counter=0?

Answer

A

because of closures, global variables can only be seen by the driver and not the task closure, so each task takes a copy of the global variables and updates it independently from other worker nodes.

Question 25

Q

What is a broadcast variable?

Answer

A

Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only.

Question 26

Q

What is an accumulator?

Answer

A

Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum operations

Question 27

Q

What are some transformations available on an RDD?

Answer

A

Maps, filter, Reducebykey

Question 28

Q

What are some actions available on an RDD?

Answer

A

count, collect, take, top

Question 29

Q

What is a shuffle in Spark?

Answer

A

A shuffle occurs when data is rearranged between partitions. This is required when a transformation requires information from other partitions, such as summing all the values in a column.

Question 30

Q

How can we see the lineage of an RDD?

Answer

A

toDebugString