Hadoop - Sheet1 Flashcards

1
Q

What’s Hadoop

A

Apache Hadoop is a free, open-source, Java-based software framework used to store, maintain, and process large-scale sets of data across numerous clusters of commodity hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is Hadoop useful

A

Hadoop is fault tolerant, meaning the system will simply redirect to another location and resume work when a node is lost. Hadoop is also schema-less and can absorb data of all types, sources, and structures, allowing for deeper analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the four modules that make up the Apache Hadoop framework?

A
  • Hadoop Common, which contains the common utilities and libraries necessary for Hadoop’s other modules.
  • Hadoop YARN, the framework’s platform for resource-management
  • Hadoop Distributed File System, or HDFS, which stores information on commodity machines
  • Hadoop MapReduce, a programming model used to process large-scale sets of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are “slaves” and “masters” in Hadoop?

A

In Hadoop, slaves are a list of hosts for task tracker servers and datanodes. Masters list hosts for secondary namenode servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Namenode?

A

Namenode exists at the center of the Hadoop distributed file system cluster. It manages metadata for the file system, and datanodes, but does not store data itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How many Namenodes can run on a single Hadoop cluster?

A

Only one Namenode process can run on a single Hadoop cluster. The file system will go offline if this Namenode goes down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a datanode?

A

Unlike Namenode, a datanode actually stores data within the Hadoop distributed file system. Datanodes run on their own Java virtual machine process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How many datanodes can run on a single Hadoop cluster?

A

Hadoop slave nodes contain only one datanode process each.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is job tracker in Hadoop?

A

Job tracker is used to submit and track jobs in MapReduce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many job tracker processes can run on a single Hadoop cluster?

A

Like datanodes, there can only be one job tracker process running on a single Hadoop cluster. Job tracker processes run on their own Java virtual machine process. If job tracker goes down, all currently active jobs stop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What sorts of actions does the job tracker process perform?

A
  • Client applications send the job tracker jobs.
  • Job tracker determines the location of data by communicating with Namenode.
  • Job tracker finds nodes in task tracker that has open slots for the data.
  • Job tracker submits the job to task tracker nodes.
  • Job tracker monitors the task tracker nodes for signs of activity. If there is not enough activity, job tracker transfers the job to a different task tracker node.
  • Job tracker receives a notification from task tracker if the job has failed. From there, job tracker might submit the job elsewhere, as described above. If it doesn’t do this, it might blacklist either the job or the task tracker.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does job tracker schedule a job for the task tracker?

A

When a client application submits a job to the job tracker, job tracker searches for an empty node to schedule the task on the server that contains the assigned datanode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the mapred.job.tracker command do?

A

The mapred.job.tracker command will provide a list of nodes that are currently acting as a job tracker process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is “jps”?

A

jps is a command used to check if your task tracker, job tracker, datanode, and Namenode are working.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a “map” in Hadoop?

A

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a “reducer” in Hadoop?

A

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

17
Q

What are the parameters of mappers and reducers?

A

The four parameters for mappers are:

  • LongWritable (input)
  • text (input)
  • text (intermediate output)
  • IntWritable (intermediate output)
    The four parameters for reducers are:
  • Text (intermediate output)
  • IntWritable (intermediate output)
  • Text (final output)
  • IntWritable (final output)
18
Q

What is the difference between Input Split and an HDFS Block?

A

InputSplit and HDFS Block both refer to the division of data, but InputSplit handles the logical division while HDFS Block handles the physical division.