Hadoop and Pig Flashcards

Question 1

Q

Question 2

Q

What specific concept is HDFS designed to support?

Answer

A

HDFS is designed to support high streaming read performance and follows the concept of “write once and read many times.” It doesn’t encourage frequent updates on files.

Question 3

Q

How does HDFS ensure fault tolerance in its data storage?

Answer

A

by storing data in fixed blocks on distributed nodes called DataNodes, replicating them to handle potential failures.

Question 4

Q

How does Hadoop achieve its purpose?

Answer

A

It scales from one server to thousands, making it easy to process large datasets. It’s designed to stay available even if some computers fail, ensuring reliability on less-than-perfect systems.

Question 5

Q

What is the significance of heartbeats in HDFS, and how does the NameNode use them to manage DataNodes?

Answer

A

Heartbeats in HDFS are signals sent by DataNodes to the NameNode. The NameNode uses heartbeats to track individual DataNodes and manage the overall health of the system.

Question 6

Q

What happens if the NameNode doesn’t receive a response from a specific DataNode in HDFS?

Answer

A

it considers that DataNode as failed or faulty, and appropriate actions are taken to maintain system integrity and fault tolerance.

Question 7

Q

What is the role of the NameNode in the File System Namespace?

Answer

A

The NameNode manages and maintains the FileSystem Namespace, keeping track of all Namespace properties and information

Question 8

Q

What is the default size of data blocks in HDFS, and why is it designed to be much larger than the standard file block size?

Answer

A

The default size of data blocks in HDFS is 128 MB, significantly larger than the standard file block size of 512 bytes. This larger size is chosen for fault tolerance and availability through replication.

Question 9

Q

What potential issue arises from using small blocks in HDFS, and what does this issue result in?

Answer

A

The use of small blocks in HDFS leads to a large number of files, causing considerable interaction between the NameNode and DataNodes. This interaction creates overhead for the entire process.

Question 10

Q

What are the two main components of an HDFS cluster, and what are their respective roles?

Answer

A

An HDFS cluster includes a single NameNode (managing metadata) and multiple DataNodes (managing storage)

Question 11

Q

How does a file get stored in an HDFS cluster, and what role does the NameNode play in this process?

Answer

A

Files in HDFS are split into blocks, stored in DataNodes; the NameNode maps blocks to DataNodes.

Question 12

Q

How does HDFS handle data block storage, and what triggers communication between DataNodes and the NameNode?

Answer

A

HDFS stores data blocks locally initially and communicates with the NameNode for the next DataNode when a block is full.

Question 13

Q

What is the role of the replication factor in HDFS, and when is it determined?

Answer

A

The replication factor, set during file creation, ensures fault tolerance by creating copies on different DataNodes. If not specified, the default is three.

Question 14

Q

How does Hadoop’s rack awareness enhance fault tolerance, and what’s a drawback associated with it?

Answer

A

by distributing replicas across racks. However, it increases I/O costs due to block transfer across racks.

Question 15

Q

What happens when a DataNode fails health checks in HDFS, and how does the system handle block replication?

Answer

A

the NameNode removes the block from the pipeline and re-replicates it to a different DataNode.

Question 16

Q

What model does HDFS adopt and what is its single point of failure?

Answer

A

Master/slave model where the slave is a single point of failure

Question 17

Q

Give a solution to recovering data when a NameNode fails in HDFS

Answer

A

Use two NameNodes, one active and one on standby. The standby node takes over as the active node in the case of failure

Question 18

Q

What is the performance trade-off in MapReduce

Answer

A

Increasing the number of computing nodes enhances computation speedup in distributed systems. However, this improvement comes with a trade-off of higher data volume for intermediate reports that need to be exchanged among the nodes.

Question 19

Q

When not to use MapReduce

Answer

A

small dataset

need to process data in real time

processing data streams

Question 20

Q

Give a brief description of each of the Pig Components

Answer

A

The Parser processes Pig Scripts, ensuring correct syntax, performing type checks, and generating a directed acyclic graph (DAG) to represent Pig Latin statements and their logical flow.
The Optimizer refines the logical plan (DAG) by executing optimizations like projection and pushdown, enhancing the efficiency of subsequent processing.
The Compiler translates the optimized logical plan into a series of MapReduce jobs, facilitating the execution of data processing tasks.
The Execution Engine submits and manages the execution of MapReduce jobs on Hadoop, orchestrating the processing of data to produce the desired results.

Question 21

Q

What are some of the advantages of using Pig over MapReduce

Answer

A

High-level language

Ease of use

Significant reduction in code length

No need for compilation

Question 22

Q

What does Pig’s infrastructure layer consist of, and how does it leverage existing large-scale parallel implementations?

Answer

A

Pig’s infrastructure layer includes a compiler producing sequences of Map-Reduce programs, leveraging existing large-scale parallel implementations like Hadoop

Question 23

Q

What is MapReduce

Answer

A

A programming paradigm for processing big datasets in a distributed fashion

Question 24

Q

What are steps in MapReduce

Answer

A

Prepare the input to Map by selecting a key
Execute Map in each node and generate output based on another key
Shuffle the output from Map to the Reduce nodes
Execute Reduce
Produce the final output

Question 25

Q

What is the salient (key) property of Pig

Answer

A

the structure of Pig programs is conducive to leveraging parallel processing, making them well-suited for handling very large datasets in a distributed computing environment.

Question 26

Q

What is the application workflow in YARN

Answer

A

Client submits an application
Resource Manager allocates a container to start the application manager
The application manager registers itself with the resource manager
The Application Manager negotiates containers from the Resource Manager
The Application Manager notifies the Node Manager to launch containers
Application code is executed in the container
Client contacts Resource Manager/Application Manager to monitor application’s status
Once the processing is complete, the Application Manager un-registers with the Resource Manager