Hadoop Midterm Flashcards

1
Q

Hadoop

A

This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by MapReduce. There are
other parts to Hadoop, but these capabilities are its kernel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mapreduce

A

MapReduce is a batch query processor, and the
ability to run an ad hoc query against your whole dataset and get the results in a
reasonable time is transformative. It changes the way you think about data and
unlocks data that was previously archived on tape or disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why can’t we use databases with lots of disks to do large-scale batch analysis?
Why is MapReduce needed?

A

The answer to these questions comes from another trend in disk drives: seek time
is improving more slowly than transfer rate. Seeking is the process of moving the
disk’s head to a particular place on the disk to read or write data. It characterizes
the latency of a disk operation, whereas the transfer rate corresponds to a disk’s
bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write
large portions of the dataset than streaming through it, which operates at the
transfer rate.
An RDBMS
is good for point queries or updates, where the dataset has been indexed to deliver
low-latency retrieval and update times of a relatively small amount of data.
MapReduce suits applications where the data is written once and read many times,
whereas a relational database is good for datasets that are continually updated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

MapReduce is a linearly scalable programming model

A

The programmer writes two
functions—a map function and a reduce function—each of which defines a
mapping from one set of key-value pairs to another. These functions are oblivious
to the size of the data or the cluster that they are operating on, so they can be used
unchanged for a small dataset and for a massive one. More important, if you
double the size of the input data, a job will run twice as slow. But if you also
double the size of the cluster, a job will run as fast as the original one. This is not
generally true of SQL queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how mapreduce works

A

MapReduce works by breaking the processing into two phases: the map phase and
the reduce phase. Each phase has key-value pairs as input and output, the types of
which may be chosen by the programmer. The programmer also specifies two
functions: the map function and the reduce function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Java MapReduce

A

Java MapReduce
Having run through how the MapReduce program works, the next step is to
express it in code. We need three things: a map function, a reduce function, and
some code to run the job. The map function is represented by the Mapper class,
which declares an abstract map() method. Example 2-3 shows the implementation
of our map method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Flow

A

A MapReduce job is a unit of work that the client wants
to be performed: it consists of the input data, the MapReduce program, and
configuration information. Hadoop runs the job by dividing it into tasks, of which
there are two types: map tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtracker
and a number of tasktrackers. The jobtracker coordinates all the jobs run on the
system by scheduling tasks to run on tasktrackers. Tasktrackers run tasks and send
progress reports to the jobtracker, which keeps a record of the overall progress of
each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

splits

A

Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the
user-defined map function for each record in the split.
Having many splits means the time taken to process each split is small compared to
the time to process the whole input. So if we are processing the splits in parallel,
the processing is better load-balanced when the splits are small, since a faster
machine will be able to process proportionally more splits over the course of the
job than a slower machine. Even if the machines are identical, failed processes or
other jobs running concurrently make load balancing desirable, and the quality of
the load balancing increases as the splits become more fine-grained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

split size

A

On the other hand, if splits are too small, the overhead of managing the splits and
of map task creation begins to dominate the total job execution time. For most
jobs, a good split size tends to be the size of an HDFS block, 64 MB by default,
although this can be changed for the cluster (for all newly created files) or specified
when each file is created.
Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization because it doesn’t use valuable
cluster bandwidth.
Sometimes, however, all three nodes hosting the HDFS block
replicas for a map task’s input split are running other map tasks, so the job
scheduler will look for a free map slot on a node in the same rack as one of the
blocks. Very occasionally even this is not possible, so an off-rack node is used,
which results in an inter-rack network transfer. The three possibilities are illustrated
in Figure 2-2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

optimal split size

A

It should now be clear why the optimal split size is the same as the block size: it is
the largest size of input that can be guaranteed to be stored on a single node. If the
split spanned two blocks, it would be unlikely that any HDFS node stored both
blocks, so some of the split would have to be transferred across the network to the
node running the map task, which is clearly less efficient than running the whole
map task using local data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

where are maptask outputs

A

Map tasks write their output to the local disk, not to HDFS. Why is this? Map
output is intermediate output: it’s processed by reduce tasks to produce the final
output, and once the job is complete, the map output can be thrown away. So
storing it in HDFS with replication would be overkill. If the node running the map
task fails before the map output has been consumed by the reduce task, then
Hadoop will automatically rerun the map task on another node to re-create the
map output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

where are reduce tasks written?

A

Reduce tasks don’t have the advantage of data locality; the input to a single reduce
task is normally the output from all mappers. In the present example, we have a
single reduce task that is fed by all of the map tasks. Therefore, the sorted map
outputs have to be transferred across the network to the node where the reduce
task is running, where they are merged and then passed to the user-defined reduce
function. The output of the reduce is normally stored in HDFS for reliability. As
explained in Chapter 3, for each HDFS block of the reduce output, the first replica
is stored on the local node, with other replicas being stored on off-rack nodes.
Thus, writing the reduce output does consume network bandwidth, but only as
much as a normal HDFS write pipeline consumes.
The whole data flow with a single reduce task is illustrated in Figure 2-3. The
dotted boxes indicate nodes, the light arrows show data transfers on a node, and
the heavy arrows show data transfers between nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

combiner

A

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it
pays to minimize the data transferred between map and reduce tasks. Hadoop
allows the user to specify a combiner function to be run on the map output, and the
combiner function’s output forms the input to the reduce function. Because the
combiner function is an optimization, Hadoop does not provide a guarantee of how
many times it will call it for a particular map output record, if at all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Distributed file systems

A

Filesystems that manage the storage across a network of machines are called
distributed filesystems. Since they are network-based, all the complications of
network programming kick in, thus making distributed filesystems more complex
than regular disk filesystems. For example, one of the biggest challenges is making
the filesystem tolerate node failure without suffering data loss.
Hadoop comes with a distributed filesystem called HDFS, which stands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

hdfs concepts

A

HDFS is built around the idea that the most efficient data processing pattern is
a write-once, read-many-times pattern. A dataset is typically generated or
copied from source, and then various analyses are performed on that dataset
over time. Each analysis will involve a large proportion, if not all, of the
dataset, so the time to read the whole dataset is more important than the
latency in reading the first record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HDFS Concepts

Blocks

A

HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by
default. Like in a filesystem for a single disk, files in HDFS are broken into blocksized
chunks, which are stored as independent units. Unlike a filesystem for a
single disk, a file in HDFS that is smaller than a single block does not occupy a full
block’s worth of underlying storage. When unqualified, the term “block” in this
book refers to a block in HDFS

17
Q

WHY IS A BLOCK IN HDFS SO LARGE?

A

HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making
a block large enough, the time to transfer the data from the disk can be significantly longer than the time to
seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the
disk transfer rate.
A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make
the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is
actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be
revised upward as transfer speeds grow with new generations of disk drives.
This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally operate on one block
at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they
could otherwise.

18
Q

Types of nodes on hdfs cluster

A

name node and data node.
An HDFS cluster has two types of nodes operating in a master-worker pattern: a
namenode (the master) and a number of datanodes (workers).

19
Q

what do name nodes do

A

The namenode
manages the filesystem namespace. It maintains the filesystem tree and the
metadata for all the files and directories in the tree. This information is stored
persistently on the local disk in the form of two files: the namespace image and the
edit log.

20
Q

compare hash values

A

md5 file1.txt file2.txt