Ch. 1 - Hadoop the Definitive Guide Flashcards

1
Q

2 problems with Data

A

first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high. A common way of avoiding data
loss is through replication: redundant copies of the data are kept by the system so that
in the event of failure, there is another copy available.
2.The second problem is that most analysis tasks need to be able to combine the data in
some way, and data read from one disk may need to be combined with the data from
any of the other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

two parts to the computation hadoop

A

the map and the reduce, and it’s the interface
between the two where the “mixing” occurs. Like HDFS, MapReduce has built-in reliability.
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and
analysis. What’s more, Hadoop is affordable since it runs on commodity hardware and
is open source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does YARN Stand for?

A

Yet Another Resource Negotiator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does YARN do?

A

YARN is a cluster
resource management system, which allows any distributed program (not just MapReduce)
to run on data in a Hadoop cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is HBase?

A

a key-value store that uses
HDFS for its underlying storage. HBase provides both online read/write access of individual
rows and batch operations for reading and writing data in bulk, making it a
good solution for building applications on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What kind of Processing forr MapReduce

A

Batch Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Comparing to RDBMS Map reduce

A

Petabytes, in batch processing, write once read many times. Structure is Schema on Read, Integrity low, Scaling linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RDBMS

A

Gigabytes, Access is Interactive and batch, read and write many tiimes, ACID transactions. Structure is Schema on Write, Integrity high, scaling is Nonlinear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What type data works with with RDBMS?

A

Structured data is data that is organized into entities
that have a defined format, such as XML documents or database tables that conform to
a particular predefined schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type data work with Hadoop?

A

Semi-structured data,
on the other hand, is looser, and though there may be a schema, it is often ignored,
so it may be used only as a guide to the structure of the data: for example, a spreadsheet,
in which the structure is the grid of cells, although the cells themselves may hold any
form of data. Unstructured data does not have any particular internal structure: for
example, plain text or image data. Hadoop works well on unstructured or semistructured
data because it is designed to interpret the data at processing time, so called
schema-on-read. This provides flexibility, and avoids the costly data loading phase of
an RDBMS, since in Hadoop it is just a file copy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is Relational data normalized?

A

Relational data is often normalized to retain its integrity and remove redundancy.
Normalization poses problems for Hadoop processing because it makes reading a record
a nonlocal operation, and one of the central assumptions that Hadoop makes is that it
is possible to perform (high-speed) streaming reads and writes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MapReduce runs linearly

A

MapReduce—and the other processing models in Hadoop—scale linearly with the size
of the data. Data is partitioned, and the functional primitives (like map and reduce) can
work in parallel on separate partitions. This means that if you double the size of the
input data, a job will run twice as slowly. But if you also double the size of the cluster, a
job will run as fast as the original one. This is not generally true of SQL queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

data locality

A

Hadoop tries to co-locate the data with the compute node, so data access is fast because
it is local.This feature, known as data locality, is at the heart of data processing in
Hadoop and is the reason for its good performance. Recognizing that network bandwidth
is the most precious resource in a data center environment (it is easy to saturate
network links by copying data around), Hadoop goes to great lengths to conserve it by
explicitly modelling network topology. Notice that this arrangement does not preclude
high-CPU analyses in Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is MPI

A

Application Program Interfaces
(APIs) as Message Passing Interface (MPI). Broadly, the approach in HPC is to
distribute the work across a cluster of machines, which access a shared filesystem, hosted
by a Storage Area Network (SAN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Processing between MPI and Hadoop

A

MPI gives great control to the programmer, but requires that they explicitly handle the
mechanics of the data flow, exposed via low-level C routines and constructs such as
sockets, as well as the higher-level algorithm for the analysis. Processing in Hadoop
operates only at the higher level: the programmer thinks in terms of the data model
(such as key-value pairs for MapReduce), while the data flow remains implicit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

shared-nothing architecture

A

MapReduce spare the programmer from having
to think about failure, since the implementation detects failed tasks and reschedules
replacements on machines that are healthy. MapReduce is able to do this because it is a
shared-nothing architecture, meaning that tasks have no dependence on one other.

17
Q

where does map function output go?

A

The output from the map function is processed by the MapReduce framework
before being sent to the reduce function.

18
Q

what are the mapper class parameters

A

input key, input value, output key, and output value types of the map function

19
Q

what are the types that hadoop uses

A
Rather than use builtin Java types, Hadoop provides its own set of basic
types that are optimized for network serialization. These are found in the
org.apache.hadoop.io package. Here we use LongWritable, which
corresponds to a Java Long, Text (like Java String), and IntWritable (like Java
Integer).
The map() method is passed a key and a value. We convert the Text value
containing the line of input into a Java String, then use its substring() method
to extract the columns we are interested in.
20
Q

4 parameters for reduce function?

A

Again, four formal type parameters are used to specify the input and output types,
this time for the reduce function. The input types of the reduce function must
match the output types of the map function: Text and IntWritable. And in this
case, the output types of the reduce function are Text and IntWritable, for a
year and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.

21
Q

job object

A
A Job object forms the specification of the job and gives you control over how the
job is run. When we run this job on a Hadoop cluster, we will package the code
into a JAR file (which Hadoop will distribute around the cluster). Rather than
explicitly specify the name of the JAR file, we can pass a class in the Job’s
setJarByClass() method, which Hadoop will use to locate the relevant JAR file
by looking for the JAR file containing this class.
22
Q

what is a mapreduce job?

A

A MapReduce job is a unit of work that the client wants
to be performed: it consists of the input data, the MapReduce program, and
configuration information. Hadoop runs the job by dividing it into tasks, of which
there are two types: map tasks and reduce tasks.

23
Q

what are The two types of nodes that control the job execution process

A

jobtracker

and a number of tasktrackers.

24
Q

what is a jobtracker?

A

The jobtracker coordinates all the jobs run on the

system by scheduling tasks to run on tasktrackers

25
Q

what is a tasktracker?

A

Tasktrackers run tasks and send
progress reports to the jobtracker, which keeps a record of the overall progress of
each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.

26
Q

what is a split?

A

Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits. Hadoop creates one map task for each split, which runs the
user-defined map function for each record in the split.

27
Q

what is good size for split?

A

On the other hand, if splits are too small, the overhead of managing the splits and
of map task creation begins to dominate the total job execution time.
r most
jobs, a good split size tends to be the size of an HDFS block, 64 MB by default,
although this can be changed for the cluster (for all newly created files) or specified
when each file is created.

28
Q

what is local optimization?

A

Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization because it doesn’t use valuable
cluster bandwidth. Sometimes, however, all three nodes hosting the HDFS block
replicas for a map task’s input split are running other map tasks, so the job
scheduler will look for a free map slot on a node in the same rack as one of the
blocks. Very occasionally even this is not possible, so an off-rack node is used,
which results in an inter-rack network transfer

29
Q

why make optimal split block size?

A

It should now be clear why the optimal split size is the same as the block size: it is
the largest size of input that can be guaranteed to be stored on a single node. If the
split spanned two blocks, it would be unlikely that any HDFS node stored both
blocks, so some of the split would have to be transferred across the network to the
node running the map task, which is clearly less efficient than running the whole
map task using local data.

30
Q

why do map tasks write there output locally?

A

Map tasks write their output to the local disk, not to HDFS. Why is this? Map
output is intermediate output: it’s processed by reduce tasks to produce the final
output, and once the job is complete, the map output can be thrown away. So
storing it in HDFS with replication would be overkill. If the node running the map
task fails before the map output has been consumed by the reduce task, then
Hadoop will automatically rerun the map task on another node to re-create the
map output.

31
Q

where are reduce tasks stored?

A

Reduce tasks don’t have the advantage of data locality; the input to a single reduce
task is normally the output from all mappers. In the present example, we have a
single reduce task that is fed by all of the map tasks. Therefore, the sorted map
outputs have to be transferred across the network to the node where the reduce
task is running, where they are merged and then passed to the user-defined reduce
function. The output of the reduce is normally stored in HDFS for reliability

32
Q

is number of reduce tasks guided by size of input?

A

no. The number of reduce tasks is not governed by the size of the input, but instead is
specified independently. In The Default MapReduce Job, you will see how to
choose the number of reduce tasks for a given job.

33
Q

data flow for reduce tasks

A

The data flow for the general case of multiple reduce tasks is illustrated in
Figure 2-4. This diagram makes it clear why the data flow between map and
reduce tasks is colloquially known as “the shuffle,” as each reduce task is fed by
many map tasks. The shuffle is more complicated than this diagram suggests, and
tuning it can have a big impact on job execution time, as you will see in Shuffle
and Sort.
Figure 2-4. MapReduce data flow with multiple reduce tasks
Finally, it’s also possible to have zero reduce tasks. This can be appropriate when
you don’t need the shuffle because the processing can be carried out entirely in
parallel

34
Q

what limits mapreduce jobs?

A

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it
pays to minimize the data transferred between map and reduce tasks. Hadoop
allows the user to specify a combiner function to be run on the map output, and the
combiner function’s output forms the input to the reduce function. Because the
combiner function is an optimization, Hadoop does not provide a guarantee of how
many times it will call it for a particular map output record, if at all. In other
words, calling the combiner function zero, one, or many times should produce the
same output from the reducer.
Figure 2-5. MapReduce data flow with no reduce tasks
The contract for the combiner function constrains the type of function that may be
used.