Ch. 1 - Hadoop the Definitive Guide Flashcards

(34 cards)

1
Q

2 problems with Data

A

first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high. A common way of avoiding data
loss is through replication: redundant copies of the data are kept by the system so that
in the event of failure, there is another copy available.
2.The second problem is that most analysis tasks need to be able to combine the data in
some way, and data read from one disk may need to be combined with the data from
any of the other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

two parts to the computation hadoop

A

the map and the reduce, and it’s the interface
between the two where the “mixing” occurs. Like HDFS, MapReduce has built-in reliability.
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and
analysis. What’s more, Hadoop is affordable since it runs on commodity hardware and
is open source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does YARN Stand for?

A

Yet Another Resource Negotiator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does YARN do?

A

YARN is a cluster
resource management system, which allows any distributed program (not just MapReduce)
to run on data in a Hadoop cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is HBase?

A

a key-value store that uses
HDFS for its underlying storage. HBase provides both online read/write access of individual
rows and batch operations for reading and writing data in bulk, making it a
good solution for building applications on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What kind of Processing forr MapReduce

A

Batch Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Comparing to RDBMS Map reduce

A

Petabytes, in batch processing, write once read many times. Structure is Schema on Read, Integrity low, Scaling linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RDBMS

A

Gigabytes, Access is Interactive and batch, read and write many tiimes, ACID transactions. Structure is Schema on Write, Integrity high, scaling is Nonlinear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What type data works with with RDBMS?

A

Structured data is data that is organized into entities
that have a defined format, such as XML documents or database tables that conform to
a particular predefined schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type data work with Hadoop?

A

Semi-structured data,
on the other hand, is looser, and though there may be a schema, it is often ignored,
so it may be used only as a guide to the structure of the data: for example, a spreadsheet,
in which the structure is the grid of cells, although the cells themselves may hold any
form of data. Unstructured data does not have any particular internal structure: for
example, plain text or image data. Hadoop works well on unstructured or semistructured
data because it is designed to interpret the data at processing time, so called
schema-on-read. This provides flexibility, and avoids the costly data loading phase of
an RDBMS, since in Hadoop it is just a file copy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is Relational data normalized?

A

Relational data is often normalized to retain its integrity and remove redundancy.
Normalization poses problems for Hadoop processing because it makes reading a record
a nonlocal operation, and one of the central assumptions that Hadoop makes is that it
is possible to perform (high-speed) streaming reads and writes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MapReduce runs linearly

A

MapReduce—and the other processing models in Hadoop—scale linearly with the size
of the data. Data is partitioned, and the functional primitives (like map and reduce) can
work in parallel on separate partitions. This means that if you double the size of the
input data, a job will run twice as slowly. But if you also double the size of the cluster, a
job will run as fast as the original one. This is not generally true of SQL queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

data locality

A

Hadoop tries to co-locate the data with the compute node, so data access is fast because
it is local.This feature, known as data locality, is at the heart of data processing in
Hadoop and is the reason for its good performance. Recognizing that network bandwidth
is the most precious resource in a data center environment (it is easy to saturate
network links by copying data around), Hadoop goes to great lengths to conserve it by
explicitly modelling network topology. Notice that this arrangement does not preclude
high-CPU analyses in Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is MPI

A

Application Program Interfaces
(APIs) as Message Passing Interface (MPI). Broadly, the approach in HPC is to
distribute the work across a cluster of machines, which access a shared filesystem, hosted
by a Storage Area Network (SAN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Processing between MPI and Hadoop

A

MPI gives great control to the programmer, but requires that they explicitly handle the
mechanics of the data flow, exposed via low-level C routines and constructs such as
sockets, as well as the higher-level algorithm for the analysis. Processing in Hadoop
operates only at the higher level: the programmer thinks in terms of the data model
(such as key-value pairs for MapReduce), while the data flow remains implicit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

shared-nothing architecture

A

MapReduce spare the programmer from having
to think about failure, since the implementation detects failed tasks and reschedules
replacements on machines that are healthy. MapReduce is able to do this because it is a
shared-nothing architecture, meaning that tasks have no dependence on one other.

17
Q

where does map function output go?

A

The output from the map function is processed by the MapReduce framework
before being sent to the reduce function.

18
Q

what are the mapper class parameters

A

input key, input value, output key, and output value types of the map function

19
Q

what are the types that hadoop uses

A
Rather than use builtin Java types, Hadoop provides its own set of basic
types that are optimized for network serialization. These are found in the
org.apache.hadoop.io package. Here we use LongWritable, which
corresponds to a Java Long, Text (like Java String), and IntWritable (like Java
Integer).
The map() method is passed a key and a value. We convert the Text value
containing the line of input into a Java String, then use its substring() method
to extract the columns we are interested in.
20
Q

4 parameters for reduce function?

A

Again, four formal type parameters are used to specify the input and output types,
this time for the reduce function. The input types of the reduce function must
match the output types of the map function: Text and IntWritable. And in this
case, the output types of the reduce function are Text and IntWritable, for a
year and its maximum temperature, which we find by iterating through the
temperatures and comparing each with a record of the highest found so far.

21
Q

job object

A
A Job object forms the specification of the job and gives you control over how the
job is run. When we run this job on a Hadoop cluster, we will package the code
into a JAR file (which Hadoop will distribute around the cluster). Rather than
explicitly specify the name of the JAR file, we can pass a class in the Job’s
setJarByClass() method, which Hadoop will use to locate the relevant JAR file
by looking for the JAR file containing this class.
22
Q

what is a mapreduce job?

A

A MapReduce job is a unit of work that the client wants
to be performed: it consists of the input data, the MapReduce program, and
configuration information. Hadoop runs the job by dividing it into tasks, of which
there are two types: map tasks and reduce tasks.

23
Q

what are The two types of nodes that control the job execution process

A

jobtracker

and a number of tasktrackers.

24
Q

what is a jobtracker?

A

The jobtracker coordinates all the jobs run on the

system by scheduling tasks to run on tasktrackers

25
what is a tasktracker?
Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.
26
what is a split?
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.
27
what is good size for split?
On the other hand, if splits are too small, the overhead of managing the splits and of map task creation begins to dominate the total job execution time. r most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the cluster (for all newly created files) or specified when each file is created.
28
what is local optimization?
Hadoop does its best to run the map task on a node where the input data resides in HDFS. This is called the data locality optimization because it doesn’t use valuable cluster bandwidth. Sometimes, however, all three nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer
29
why make optimal split block size?
It should now be clear why the optimal split size is the same as the block size: it is the largest size of input that can be guaranteed to be stored on a single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored both blocks, so some of the split would have to be transferred across the network to the node running the map task, which is clearly less efficient than running the whole map task using local data.
30
why do map tasks write there output locally?
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away. So storing it in HDFS with replication would be overkill. If the node running the map task fails before the map output has been consumed by the reduce task, then Hadoop will automatically rerun the map task on another node to re-create the map output.
31
where are reduce tasks stored?
Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is normally the output from all mappers. In the present example, we have a single reduce task that is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred across the network to the node where the reduce task is running, where they are merged and then passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS for reliability
32
is number of reduce tasks guided by size of input?
no. The number of reduce tasks is not governed by the size of the input, but instead is specified independently. In The Default MapReduce Job, you will see how to choose the number of reduce tasks for a given job.
33
data flow for reduce tasks
The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This diagram makes it clear why the data flow between map and reduce tasks is colloquially known as “the shuffle,” as each reduce task is fed by many map tasks. The shuffle is more complicated than this diagram suggests, and tuning it can have a big impact on job execution time, as you will see in Shuffle and Sort. Figure 2-4. MapReduce data flow with multiple reduce tasks Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t need the shuffle because the processing can be carried out entirely in parallel
34
what limits mapreduce jobs?
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output, and the combiner function’s output forms the input to the reduce function. Because the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer. Figure 2-5. MapReduce data flow with no reduce tasks The contract for the combiner function constrains the type of function that may be used.