Revature Hadoop 2 Flashcards

1
Q

Explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable]

A

The Mapper reads input key-value pairs (LongWritable, Text) and outputs intermediate key-value pairs (Text, IntWritable). The Reducer takes these intermediate key-value pairs and outputs final results (Text, IntWritable).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What needs to be true about the types contained in the above generics?

A

The input and output key-value pairs must match the data types specified by the Mapper and Reducer, and the input types of Reducer must match the output types of Mapper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 3 Vs of big data?

A

Volume, Velocity, Variety.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some examples of structured data? Unstructured data?

A

Structured: Relational databases, CSV files. Unstructured: Images, videos, emails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a daemon?

A

A background process that runs continuously to provide services or perform tasks (e.g., NameNode, DataNode in Hadoop).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data locality and why is it important?

A

Data locality refers to the concept of storing and processing data on the same computer or network node where it is needed, minimizing the need to transfer large amounts of data across the network, thus significantly improving performance and efficiency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?

A

2 blocks (default block size is 128MB).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the default number of replications for each block?

A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are these replications typically distributed across the cluster? What is rack awareness?

A

Replications are stored across different nodes and racks for fault tolerance. Rack awareness ensures copies of data are distributed across racks to prevent data loss from rack failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the job of the NameNode? What about the DataNode?

A

NameNode manages metadata and file system structure. DataNode stores actual data blocks and performs read/write operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How many NameNodes exist on a cluster?

A

Typically one active NameNode, but high-availability setups include one standby NameNode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are DataNodes fault tolerant?

A

Data blocks are replicated across multiple DataNodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does a Standby NameNode make the NameNode fault tolerant?

A

It keeps a synchronized copy of metadata from the active NameNode and can take over if the active NameNode fails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What purpose does a Secondary NameNode serve?

A

It periodically merges NameNode’s metadata and edits logs, reducing recovery time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How might we scale a HDFS cluster past a few thousand machines?

A

HDFS Federations with multiple NameNodes can be used for large clusters with tens of thousands of machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In a typical Hadoop cluster, what’s the relationship between HDFS data nodes and YARN node managers?

A

HDFS DataNodes and YARN NodeManagers typically coexist on the same nodes, enabling data locality for optimized processing.

17
Q

When does the combine phase run, and where does each combine task run?

A

The combine phase runs after the Mapper phase, locally on the Mapper’s output.

18
Q

Know the input and output of the shuffle + sort phase.

A

Input: Intermediate key-value pairs from Mapper output. Output: Sorted key-value pairs grouped by key for the Reducer.

19
Q

What does the NodeManager do?

A

Manages resources on a node and launches containers for task execution.

20
Q

What about the ResourceManager?

A

Manages cluster resources and schedules jobs.

21
Q

Which responsibilities does the Scheduler have?

A

Allocates resources to applications based on policies like FIFO, Capacity, or Fair scheduling.

22
Q

What about the ApplicationsManager?

A

Manages the lifecycle of applications, including application submission and monitoring.

23
Q

What is an ApplicationMaster? How many of them are there per job?

A

The ApplicationMaster manages the lifecycle of a specific job, including resource requests and task execution. There is one ApplicationMaster per job.