Revature Hadoop 2 Flashcards

Question 1

Q

Explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable]

Answer

A

The Mapper reads input key-value pairs (LongWritable, Text) and outputs intermediate key-value pairs (Text, IntWritable). The Reducer takes these intermediate key-value pairs and outputs final results (Text, IntWritable).

Question 2

Q

What needs to be true about the types contained in the above generics?

Answer

A

The input and output key-value pairs must match the data types specified by the Mapper and Reducer, and the input types of Reducer must match the output types of Mapper.

Question 3

Q

What are the 3 Vs of big data?

Answer

A

Volume, Velocity, Variety.

Question 4

Q

What are some examples of structured data? Unstructured data?

Answer

A

Structured: Relational databases, CSV files. Unstructured: Images, videos, emails.

Question 5

Q

What is a daemon?

Answer

A

A background process that runs continuously to provide services or perform tasks (e.g., NameNode, DataNode in Hadoop).

Question 6

Q

What is data locality and why is it important?

Answer

A

Data locality refers to the concept of storing and processing data on the same computer or network node where it is needed, minimizing the need to transfer large amounts of data across the network, thus significantly improving performance and efficiency

Question 7

Q

How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?

Answer

A

2 blocks (default block size is 128MB).

Question 8

Q

What is the default number of replications for each block?

Question 9

Q

How are these replications typically distributed across the cluster? What is rack awareness?

Answer

A

Replications are stored across different nodes and racks for fault tolerance. Rack awareness ensures copies of data are distributed across racks to prevent data loss from rack failure.

Question 10

Q

What is the job of the NameNode? What about the DataNode?

Answer

A

NameNode manages metadata and file system structure. DataNode stores actual data blocks and performs read/write operations.

Question 11

Q

How many NameNodes exist on a cluster?

Answer

A

Typically one active NameNode, but high-availability setups include one standby NameNode.

Question 12

Q

How are DataNodes fault tolerant?

Answer

A

Data blocks are replicated across multiple DataNodes.

Question 13

Q

How does a Standby NameNode make the NameNode fault tolerant?

Answer

A

It keeps a synchronized copy of metadata from the active NameNode and can take over if the active NameNode fails.

Question 14

Q

What purpose does a Secondary NameNode serve?

Answer

A

It periodically merges NameNode’s metadata and edits logs, reducing recovery time.

Question 15

Q

How might we scale a HDFS cluster past a few thousand machines?

Answer

A

HDFS Federations with multiple NameNodes can be used for large clusters with tens of thousands of machines.

Question 16

Q

In a typical Hadoop cluster, what’s the relationship between HDFS data nodes and YARN node managers?

Answer

Study These Flashcards

A

HDFS DataNodes and YARN NodeManagers typically coexist on the same nodes, enabling data locality for optimized processing.

Question 17

Q

When does the combine phase run, and where does each combine task run?

Answer

Study These Flashcards

A

The combine phase runs after the Mapper phase, locally on the Mapper’s output.

Question 18

Q

Know the input and output of the shuffle + sort phase.

Answer

Study These Flashcards

A

Input: Intermediate key-value pairs from Mapper output. Output: Sorted key-value pairs grouped by key for the Reducer.

Question 19

Q

What does the NodeManager do?

Answer

Study These Flashcards

A

Manages resources on a node and launches containers for task execution.

Question 20

Q

What about the ResourceManager?

Answer

Study These Flashcards

A

Manages cluster resources and schedules jobs.

Question 21

Q

Which responsibilities does the Scheduler have?

Answer

Study These Flashcards

A

Allocates resources to applications based on policies like FIFO, Capacity, or Fair scheduling.

Question 22

Q

What about the ApplicationsManager?

Answer

Study These Flashcards

A

Manages the lifecycle of applications, including application submission and monitoring.

Question 23

Q

What is an ApplicationMaster? How many of them are there per job?

Answer

Study These Flashcards

A

The ApplicationMaster manages the lifecycle of a specific job, including resource requests and task execution. There is one ApplicationMaster per job.

Revature Hadoop 2 Flashcards

(23 cards)