Basics Flashcards by Trevor Christie-Taylor

Hadoop uses _____ as its basic data unit

Key value pairs

How well did you know this?

Not at all

Perfectly

Difference between SQL and Mapreduce

SQL = Declaritive

Mapreduce Specify the steps to solve the problem

How well did you know this?

Not at all

Perfectly

Hadoop is best used as a write _____ read _____ type of data store

Once

Many

How well did you know this?

Not at all

Perfectly

Under the MapReduce model, the data processing primitives are called

mappers and reducers

How well did you know this?

Not at all

Perfectly

A _____ is a pair of consecutive words

Bigram

How well did you know this?

Not at all

Perfectly

MapReduce uses ____ and _________ as its main data primitives

Lists

Key Value Pairs

How well did you know this?

Not at all

Perfectly

In the MapReduce framework you write applications by specifying the ______ and the _______

mapper

reducer

How well did you know this?

Not at all

Perfectly

In simple terms the mapper is meant to __________ the input into something that the reducer can ___________ over

filter and transform

aggregate over

How well did you know this?

Not at all

Perfectly

The input format for processing multiple files is usually

list()

How well did you know this?

Not at all

Perfectly

The input format for processing one large file, such as a log file is

list()

How well did you know this?

Not at all

Perfectly

Authentication can be based on 1. ______________ or if security is turned on 2._______________

user.name query parameter (as part of the HTTP query string)
then it relies on Kerberos.

How well did you know this?

Not at all

Perfectly

Security features of Hadoop consist of

authentication,
service level authorisation,
authentication for Web consoles and data confidentiality.

How well did you know this?

Not at all

Perfectly

The simplest way to do authentication is

using kinit command of Kerberos.

How well did you know this?

Not at all

Perfectly

What is Yarn

Yet Another Resource Negotiator - A cluster management technology.
A large-scale, distributed operating system for big data applications.

How well did you know this?

Not at all

Perfectly

Your MapReduce programs then process data, but they usually don’t read any HDFS files directly. Instead they …

Instead they rely on the MapReduce framework to read and parse the HDFS files into individual records (key/ value pairs), which are the unit of data MapReduce programs do work on.

How well did you know this?

Not at all

Perfectly

After distributing the data to different nodes, the only time nodes communicate with each other is at the

“shuffle” step

How well did you know this?

Not at all

Perfectly

The ________ servers as the base class for mappers and reducers

MapReduceBase class

How well did you know this?

Not at all

Perfectly

To serve as a mapper, a class should

Inherit the MapReduceBase class and implement the Mapper interface

How well did you know this?

Not at all

Perfectly

The single method on the mapper interface is

Study These Flashcards

void map(K1 key,
         V1 value,
         OutputCollector output,
         Reporter reporter
        ) throws IOException
// processes an individual key value pair

What does the output collector do?

Study These Flashcards

Receives the output of the mapping process

What does the reporter do

Study These Flashcards

provides the option to record extra information about the mapper as the task progresses.

What are some useful mapper implementations predefined by Hadoop

Study These Flashcards

IdentityMapper
InverseMapper
RegexMapper
TokenCountMapper

IdentityMapper

Study These Flashcards

Implements Mapper and maps inputs directly to outputs

InverseMapper

Study These Flashcards

Implements Mapper and reverses the key/value pair

RegexMapper

Implements Mapper and generates a (match, 1) pair for every regular expression match

TokenCountMapper

Implements Mapper and generates a (token, 1) pair when the input value is tokenized

The single method on the reducer interface is

``` void reduce(K2 key, Iterator values, OutputCollector output, Reporter reporter ) throws IOException ```

With multiple reducers, we need some way to determine the appropriate one to send a (key/value) pair outputted by a mapper. The default behavior is to

hash the key to determine the reducer. Hadoop enforces this strategy by use of the HashPartitioner class.

One of the fundamental principles of MapReduce’s processing power is the splitting of the input data into chunks. You can process these chunks in parallel using multiple machines. In Hadoop terminology these chunks are called

input splits

The size of each (input) split should be small enough for

a more granular parallelization | If all the input data is in one split, then there is no parallelization.

The size of each split should be small enough for a more granular parallelization. (If all the input data is in one split, then there is no parallelization.) On the other hand, each split shouldn’t be so small that

the overhead of starting and stopping the processing of a split becomes a large fraction of execution time.

The principle of dividing input data (which often can be one single massive file) into splits for parallel processing explains some of the design decisions behind Hadoop’s generic FileSystem as well as HDFS in particular. For example, Hadoop’s FileSystem provides the class FSDataInputStream for file reading rather than using Java’s java. io.DataInputStream. FSDataInputStream extends DataInputStream with ...

random read access, a feature that MapReduce requires because a machine may be assigned to process a split that sits right in the middle of an input file. Without random access, it would be extremely inefficient to have to read the file from the beginning until you reach the location of the split.

What is the difference between HDFS blocks and input splits?

Input splits are the logical division of records whereas HDFS blocks are a physical division of the input data.

If Hadoop works on key value pairs. How does it consider the lines of an input file by default?

Each line of the input file is a record. The key value pair consists of the byte offset (key) and the line (value) respectively.

The way an input file is split up and read by Hadoop is defined by

one of the implementations of the InputFormat interface.

__________ is the default input format implementation

TextInputFormat

TextInputFormat is the default Input- Format implementation It’s useful for

input data that has no definite key value, when you want to get the content one line at a time.

KeyValueTextInputFormat

Each line in the text files is a record. The first separator character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input. line property, and the default is the tab (\t) character.

SequenceFileInputFormat

An InputFormat for reading in sequence files. Key and Value are user defined

What is a sequence file?

a Hadoop- specific compressed binary file format

What are sequence files optimised for?

passing data between the output of one MapReduce job to the input of some other MapReduce job.

NLineInputFormat

Same as TextInputFormat, but each split is guaranteed to have exactly N lines. The mapred.line.input.format. linespermap property, which defaults to one, sets N.

The input data to your MapReduce job does not necessarily have to be some external data. In fact it’s often the case that

the input to one MapReduce job is the output of some other MapReduce job.

In creating your own InputFormat class you should subclass the ___________ class, which takes care of ____________

``` FileInputFormat File Splitting (dividing files into splits) ```

All of the main InputFormat classes packaged with Hadoop subclass ___________, which implements the ___________ method but leaves _____________ abstract for the subclass to fill out.

FileInputFormat getSplits() getRecordReader()

FileInputFormat’s getSplits() implementation tries to divide the input data into roughly the number of splits specified in __________ In practice, a split usually ends up being the size of a block, which defaults to ____ in HDFS.

numsplits | 64 MB

In using FileInputFormat you focus on _______________

customising RecordReader

Basics Flashcards

(47 cards)