Ch. 1 - Hadoop the Definitive Guide Flashcards
2 problems with Data
first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high. A common way of avoiding data
loss is through replication: redundant copies of the data are kept by the system so that
in the event of failure, there is another copy available.
2.The second problem is that most analysis tasks need to be able to combine the data in
some way, and data read from one disk may need to be combined with the data from
any of the other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging.
two parts to the computation hadoop
the map and the reduce, and it’s the interface
between the two where the “mixing” occurs. Like HDFS, MapReduce has built-in reliability.
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and
analysis. What’s more, Hadoop is affordable since it runs on commodity hardware and
is open source.
What does YARN Stand for?
Yet Another Resource Negotiator
What does YARN do?
YARN is a cluster
resource management system, which allows any distributed program (not just MapReduce)
to run on data in a Hadoop cluster
What is HBase?
a key-value store that uses
HDFS for its underlying storage. HBase provides both online read/write access of individual
rows and batch operations for reading and writing data in bulk, making it a
good solution for building applications on.
What kind of Processing forr MapReduce
Batch Processing
Comparing to RDBMS Map reduce
Petabytes, in batch processing, write once read many times. Structure is Schema on Read, Integrity low, Scaling linear
RDBMS
Gigabytes, Access is Interactive and batch, read and write many tiimes, ACID transactions. Structure is Schema on Write, Integrity high, scaling is Nonlinear
What type data works with with RDBMS?
Structured data is data that is organized into entities
that have a defined format, such as XML documents or database tables that conform to
a particular predefined schema
What type data work with Hadoop?
Semi-structured data,
on the other hand, is looser, and though there may be a schema, it is often ignored,
so it may be used only as a guide to the structure of the data: for example, a spreadsheet,
in which the structure is the grid of cells, although the cells themselves may hold any
form of data. Unstructured data does not have any particular internal structure: for
example, plain text or image data. Hadoop works well on unstructured or semistructured
data because it is designed to interpret the data at processing time, so called
schema-on-read. This provides flexibility, and avoids the costly data loading phase of
an RDBMS, since in Hadoop it is just a file copy
Why is Relational data normalized?
Relational data is often normalized to retain its integrity and remove redundancy.
Normalization poses problems for Hadoop processing because it makes reading a record
a nonlocal operation, and one of the central assumptions that Hadoop makes is that it
is possible to perform (high-speed) streaming reads and writes.
MapReduce runs linearly
MapReduce—and the other processing models in Hadoop—scale linearly with the size
of the data. Data is partitioned, and the functional primitives (like map and reduce) can
work in parallel on separate partitions. This means that if you double the size of the
input data, a job will run twice as slowly. But if you also double the size of the cluster, a
job will run as fast as the original one. This is not generally true of SQL queries.
data locality
Hadoop tries to co-locate the data with the compute node, so data access is fast because
it is local.This feature, known as data locality, is at the heart of data processing in
Hadoop and is the reason for its good performance. Recognizing that network bandwidth
is the most precious resource in a data center environment (it is easy to saturate
network links by copying data around), Hadoop goes to great lengths to conserve it by
explicitly modelling network topology. Notice that this arrangement does not preclude
high-CPU analyses in Hadoop.
What is MPI
Application Program Interfaces
(APIs) as Message Passing Interface (MPI). Broadly, the approach in HPC is to
distribute the work across a cluster of machines, which access a shared filesystem, hosted
by a Storage Area Network (SAN).
Processing between MPI and Hadoop
MPI gives great control to the programmer, but requires that they explicitly handle the
mechanics of the data flow, exposed via low-level C routines and constructs such as
sockets, as well as the higher-level algorithm for the analysis. Processing in Hadoop
operates only at the higher level: the programmer thinks in terms of the data model
(such as key-value pairs for MapReduce), while the data flow remains implicit