Basics Flashcards

1
Q

Hadoop uses _____ as its basic data unit

A

Key value pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Difference between SQL and Mapreduce

A

SQL = Declaritive

Mapreduce Specify the steps to solve the problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hadoop is best used as a write _____ read _____ type of data store

A

Once

Many

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Under the MapReduce model, the data processing primitives are called

A

mappers and reducers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A _____ is a pair of consecutive words

A

Bigram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MapReduce uses ____ and _________ as its main data primitives

A

Lists

Key Value Pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In the MapReduce framework you write applications by specifying the ______ and the _______

A

mapper

reducer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In simple terms the mapper is meant to __________ the input into something that the reducer can ___________ over

A

filter and transform

aggregate over

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The input format for processing multiple files is usually

A

list()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The input format for processing one large file, such as a log file is

A

list()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Authentication can be based on 1. ______________ or if security is turned on 2._______________

A
  1. user.name query parameter (as part of the HTTP query string)
  2. then it relies on Kerberos.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Security features of Hadoop consist of

A

authentication,
service level authorisation,
authentication for Web consoles and data confidentiality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The simplest way to do authentication is

A

using kinit command of Kerberos.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Yarn

A

Yet Another Resource Negotiator - A cluster management technology.
A large-scale, distributed operating system for big data applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Your MapReduce programs then process data, but they usually don’t read any HDFS files directly. Instead they …

A

Instead they rely on the MapReduce framework to read and parse the HDFS files into individual records (key/ value pairs), which are the unit of data MapReduce programs do work on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

After distributing the data to different nodes, the only time nodes communicate with each other is at the

A

“shuffle” step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The ________ servers as the base class for mappers and reducers

A

MapReduceBase class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

To serve as a mapper, a class should

A

Inherit the MapReduceBase class and implement the Mapper interface

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The single method on the mapper interface is

A
void map(K1 key,
         V1 value,
         OutputCollector output,
         Reporter reporter
        ) throws IOException
// processes an individual key value pair
20
Q

What does the output collector do?

A

Receives the output of the mapping process

21
Q

What does the reporter do

A

provides the option to record extra information about the mapper as the task progresses.

22
Q

What are some useful mapper implementations predefined by Hadoop

A

IdentityMapper
InverseMapper
RegexMapper
TokenCountMapper

23
Q

IdentityMapper

A

Implements Mapper and maps inputs directly to outputs

24
Q

InverseMapper

A

Implements Mapper and reverses the key/value pair

25
Q

RegexMapper

A

Implements Mapper and generates a (match, 1) pair for every regular expression match

26
Q

TokenCountMapper

A

Implements Mapper and generates a (token, 1) pair when the input value is tokenized

27
Q

The single method on the reducer interface is

A
void reduce(K2 key,
                    Iterator values,
                    OutputCollector output,
                    Reporter reporter
                   ) throws IOException
28
Q

With multiple reducers, we need some way to determine the appropriate one to send a (key/value) pair outputted by a mapper. The default behavior is to

A

hash the key to determine the reducer. Hadoop enforces this strategy by use of the HashPartitioner class.

29
Q

One of the fundamental principles of MapReduce’s processing power is the splitting of the input data into chunks. You can process these chunks in parallel using multiple machines. In Hadoop terminology these chunks are called

A

input splits

30
Q

The size of each (input) split should be small enough for

A

a more granular parallelization

If all the input data is in one split, then there is no parallelization.

31
Q

The size of each split should be small enough for a more granular parallelization. (If all the input data is in one split, then there is no parallelization.) On the other hand, each split shouldn’t be so small that

A

the overhead of starting and stopping the processing of a split becomes a large fraction of execution time.

32
Q

The principle of dividing input data (which often can be one single massive file) into splits for parallel processing explains some of the design decisions behind Hadoop’s generic FileSystem as well as HDFS in particular. For example, Hadoop’s FileSystem provides the class FSDataInputStream for file reading rather than using Java’s java. io.DataInputStream. FSDataInputStream extends DataInputStream with …

A

random read access, a feature that MapReduce requires because a machine may be assigned to process a split that sits right in the middle of an input file. Without random access, it would be extremely inefficient to have to read the file from the beginning until you reach the location of the split.

33
Q

What is the difference between HDFS blocks and input splits?

A

Input splits are the logical division of records whereas HDFS blocks are a physical division of the input data.

34
Q

If Hadoop works on key value pairs. How does it consider the lines of an input file by default?

A

Each line of the input file is a record. The key value pair consists of the byte offset (key) and the line (value) respectively.

35
Q

The way an input file is split up and read by Hadoop is defined by

A

one of the implementations of the InputFormat interface.

36
Q

__________ is the default input format implementation

A

TextInputFormat

37
Q

TextInputFormat is the default Input- Format implementation It’s useful for

A

input data that has no definite key value, when you want to get the content one line at a time.

38
Q

KeyValueTextInputFormat

A

Each line in the text files is a record. The first separator character divides each line. Everything before the separator is the key, and everything after is the value. The separator is set by the key.value.separator.in.input. line property, and the default is the tab (\t) character.

39
Q

SequenceFileInputFormat

A

An InputFormat for reading in sequence files. Key and Value are user defined

40
Q

What is a sequence file?

A

a Hadoop- specific compressed binary file format

41
Q

What are sequence files optimised for?

A

passing data between the output of one MapReduce job to the input of some other MapReduce job.

42
Q

NLineInputFormat

A

Same as TextInputFormat, but each split is guaranteed to have exactly N lines. The mapred.line.input.format. linespermap property, which defaults to one, sets N.

43
Q

The input data to your MapReduce job does not necessarily have to be some external data. In fact it’s often the case that

A

the input to one MapReduce job is the output of some other MapReduce job.

44
Q

In creating your own InputFormat class you should subclass the ___________ class, which takes care of ____________

A
FileInputFormat
File Splitting (dividing files into splits)
45
Q

All of the main InputFormat classes packaged with Hadoop subclass ___________, which implements the ___________ method but leaves _____________ abstract for the subclass to fill out.

A

FileInputFormat
getSplits()
getRecordReader()

46
Q

FileInputFormat’s getSplits() implementation tries to divide the input data into roughly the number of splits specified in __________ In practice, a split usually ends up being the size of a block, which defaults to ____ in HDFS.

A

numsplits

64 MB

47
Q

In using FileInputFormat you focus on _______________

A

customising RecordReader