Basics Flashcards
Hadoop uses _____ as its basic data unit
Key value pairs
Difference between SQL and Mapreduce
SQL = Declaritive
Mapreduce Specify the steps to solve the problem
Hadoop is best used as a write _____ read _____ type of data store
Once
Many
Under the MapReduce model, the data processing primitives are called
mappers and reducers
A _____ is a pair of consecutive words
Bigram
MapReduce uses ____ and _________ as its main data primitives
Lists
Key Value Pairs
In the MapReduce framework you write applications by specifying the ______ and the _______
mapper
reducer
In simple terms the mapper is meant to __________ the input into something that the reducer can ___________ over
filter and transform
aggregate over
The input format for processing multiple files is usually
list()
The input format for processing one large file, such as a log file is
list()
Authentication can be based on 1. ______________ or if security is turned on 2._______________
- user.name query parameter (as part of the HTTP query string)
- then it relies on Kerberos.
Security features of Hadoop consist of
authentication,
service level authorisation,
authentication for Web consoles and data confidentiality.
The simplest way to do authentication is
using kinit command of Kerberos.
What is Yarn
Yet Another Resource Negotiator - A cluster management technology.
A large-scale, distributed operating system for big data applications.
Your MapReduce programs then process data, but they usually don’t read any HDFS files directly. Instead they …
Instead they rely on the MapReduce framework to read and parse the HDFS files into individual records (key/ value pairs), which are the unit of data MapReduce programs do work on.
After distributing the data to different nodes, the only time nodes communicate with each other is at the
“shuffle” step
The ________ servers as the base class for mappers and reducers
MapReduceBase class
To serve as a mapper, a class should
Inherit the MapReduceBase class and implement the Mapper interface
The single method on the mapper interface is
void map(K1 key, V1 value, OutputCollector output, Reporter reporter ) throws IOException // processes an individual key value pair
What does the output collector do?
Receives the output of the mapping process
What does the reporter do
provides the option to record extra information about the mapper as the task progresses.
What are some useful mapper implementations predefined by Hadoop
IdentityMapper
InverseMapper
RegexMapper
TokenCountMapper
IdentityMapper
Implements Mapper and maps inputs directly to outputs
InverseMapper
Implements Mapper and reverses the key/value pair
RegexMapper
Implements Mapper and generates a (match, 1) pair for every regular expression match
TokenCountMapper
Implements Mapper and generates a (token, 1) pair when the input value is tokenized
The single method on the reducer interface is
void reduce(K2 key, Iterator values, OutputCollector output, Reporter reporter ) throws IOException
With multiple reducers, we need some way to determine the appropriate one to send a (key/value) pair outputted by a mapper. The default behavior is to
hash the key to determine the reducer. Hadoop enforces this strategy by use of the HashPartitioner class.
One of the fundamental principles of MapReduce’s processing power is the splitting of the input data into chunks. You can process these chunks in parallel using multiple machines. In Hadoop terminology these chunks are called
input splits
The size of each (input) split should be small enough for
a more granular parallelization
If all the input data is in one split, then there is no parallelization.
The size of each split should be small enough for a more granular parallelization. (If all the input data is in one split, then there is no parallelization.) On the other hand, each split shouldn’t be so small that
the overhead of starting and stopping the processing of a split becomes a large fraction of execution time.
The principle of dividing input data (which often can be one single massive file) into splits for parallel processing explains some of the design decisions behind Hadoop’s generic FileSystem as well as HDFS in particular. For example, Hadoop’s FileSystem provides the class FSDataInputStream for file reading rather than using Java’s java. io.DataInputStream. FSDataInputStream extends DataInputStream with …
random read access, a feature that MapReduce requires because a machine may be assigned to process a split that sits right in the middle of an input file. Without random access, it would be extremely inefficient to have to read the file from the beginning until you reach the location of the split.
What is the difference between HDFS blocks and input splits?
Input splits are the logical division of records whereas HDFS blocks are a physical division of the input data.
If Hadoop works on key value pairs. How does it consider the lines of an input file by default?
Each line of the input file is a record. The key value pair consists of the byte offset (key) and the line (value) respectively.
The way an input file is split up and read by Hadoop is defined by
one of the implementations of the InputFormat interface.
__________ is the default input format implementation
TextInputFormat
TextInputFormat is the default Input- Format implementation It’s useful for
input data that has no definite key value, when you want to get the content one line at a time.
KeyValueTextInputFormat
Each line in the text files is a record. The first separator character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input. line property, and the default is the tab (\t) character.
SequenceFileInputFormat
An InputFormat for reading in sequence files. Key and Value are user defined
What is a sequence file?
a Hadoop- specific compressed binary file format
What are sequence files optimised for?
passing data between the output of one MapReduce job to the input of some other MapReduce job.
NLineInputFormat
Same as TextInputFormat, but each split is guaranteed to have exactly N lines. The mapred.line.input.format. linespermap property, which defaults to one, sets N.
The input data to your MapReduce job does not necessarily have to be some external data. In fact it’s often the case that
the input to one MapReduce job is the output of some other MapReduce job.
In creating your own InputFormat class you should subclass the ___________ class, which takes care of ____________
FileInputFormat File Splitting (dividing files into splits)
All of the main InputFormat classes packaged with Hadoop subclass ___________, which implements the ___________ method but leaves _____________ abstract for the subclass to fill out.
FileInputFormat
getSplits()
getRecordReader()
FileInputFormat’s getSplits() implementation tries to divide the input data into roughly the number of splits specified in __________ In practice, a split usually ends up being the size of a block, which defaults to ____ in HDFS.
numsplits
64 MB
In using FileInputFormat you focus on _______________
customising RecordReader