Hadoop and Pig Flashcards

1
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What specific concept is HDFS designed to support?

A

HDFS is designed to support high streaming read performance and follows the concept of “write once and read many times.” It doesn’t encourage frequent updates on files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does HDFS ensure fault tolerance in its data storage?

A

by storing data in fixed blocks on distributed nodes called DataNodes, replicating them to handle potential failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does Hadoop achieve its purpose?

A

It scales from one server to thousands, making it easy to process large datasets. It’s designed to stay available even if some computers fail, ensuring reliability on less-than-perfect systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the significance of heartbeats in HDFS, and how does the NameNode use them to manage DataNodes?

A

Heartbeats in HDFS are signals sent by DataNodes to the NameNode. The NameNode uses heartbeats to track individual DataNodes and manage the overall health of the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens if the NameNode doesn’t receive a response from a specific DataNode in HDFS?

A

it considers that DataNode as failed or faulty, and appropriate actions are taken to maintain system integrity and fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the role of the NameNode in the File System Namespace?

A

The NameNode manages and maintains the FileSystem Namespace, keeping track of all Namespace properties and information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the default size of data blocks in HDFS, and why is it designed to be much larger than the standard file block size?

A

The default size of data blocks in HDFS is 128 MB, significantly larger than the standard file block size of 512 bytes. This larger size is chosen for fault tolerance and availability through replication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What potential issue arises from using small blocks in HDFS, and what does this issue result in?

A

The use of small blocks in HDFS leads to a large number of files, causing considerable interaction between the NameNode and DataNodes. This interaction creates overhead for the entire process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two main components of an HDFS cluster, and what are their respective roles?

A

An HDFS cluster includes a single NameNode (managing metadata) and multiple DataNodes (managing storage)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does a file get stored in an HDFS cluster, and what role does the NameNode play in this process?

A

Files in HDFS are split into blocks, stored in DataNodes; the NameNode maps blocks to DataNodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does HDFS handle data block storage, and what triggers communication between DataNodes and the NameNode?

A

HDFS stores data blocks locally initially and communicates with the NameNode for the next DataNode when a block is full.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the role of the replication factor in HDFS, and when is it determined?

A

The replication factor, set during file creation, ensures fault tolerance by creating copies on different DataNodes. If not specified, the default is three.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does Hadoop’s rack awareness enhance fault tolerance, and what’s a drawback associated with it?

A

by distributing replicas across racks. However, it increases I/O costs due to block transfer across racks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens when a DataNode fails health checks in HDFS, and how does the system handle block replication?

A

the NameNode removes the block from the pipeline and re-replicates it to a different DataNode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What model does HDFS adopt and what is its single point of failure?

A

Master/slave model where the slave is a single point of failure

16
Q

Give a solution to recovering data when a NameNode fails in HDFS

A

Use two NameNodes, one active and one on standby. The standby node takes over as the active node in the case of failure

17
Q

What is the performance trade-off in MapReduce

A

Increasing the number of computing nodes enhances computation speedup in distributed systems. However, this improvement comes with a trade-off of higher data volume for intermediate reports that need to be exchanged among the nodes.

18
Q

When not to use MapReduce

A

small dataset

need to process data in real time

processing data streams

19
Q

Give a brief description of each of the Pig Components

A
  1. The Parser processes Pig Scripts, ensuring correct syntax, performing type checks, and generating a directed acyclic graph (DAG) to represent Pig Latin statements and their logical flow.
  2. The Optimizer refines the logical plan (DAG) by executing optimizations like projection and pushdown, enhancing the efficiency of subsequent processing.
  3. The Compiler translates the optimized logical plan into a series of MapReduce jobs, facilitating the execution of data processing tasks.
  4. The Execution Engine submits and manages the execution of MapReduce jobs on Hadoop, orchestrating the processing of data to produce the desired results.
20
Q

What are some of the advantages of using Pig over MapReduce

A

High-level language

Ease of use

Significant reduction in code length

No need for compilation

21
Q

What does Pig’s infrastructure layer consist of, and how does it leverage existing large-scale parallel implementations?

A

Pig’s infrastructure layer includes a compiler producing sequences of Map-Reduce programs, leveraging existing large-scale parallel implementations like Hadoop

22
Q

What is MapReduce

A

A programming paradigm for processing big datasets in a distributed fashion

23
Q

What are steps in MapReduce

A
  1. Prepare the input to Map by selecting a key
  2. Execute Map in each node and generate output based on another key
  3. Shuffle the output from Map to the Reduce nodes
  4. Execute Reduce
  5. Produce the final output
24
Q

What is the salient (key) property of Pig

A

the structure of Pig programs is conducive to leveraging parallel processing, making them well-suited for handling very large datasets in a distributed computing environment.

25
Q

What is the application workflow in YARN

A
  1. Client submits an application
  2. Resource Manager allocates a container to start the application manager
  3. The application manager registers itself with the resource manager
  4. The Application Manager negotiates containers from the Resource Manager
  5. The Application Manager notifies the Node Manager to launch containers
  6. Application code is executed in the container
  7. Client contacts Resource Manager/Application Manager to monitor application’s status
  8. Once the processing is complete, the Application Manager un-registers with the Resource Manager