Apache Hadoop Flashcards

1
Q

What is Hadoop?

A

open-source software framework for distributed storage and processing of large data sets, using a clustered network of machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the key components of Apache Hadoop? (3)

A
  1. HDFS
  2. YARN
  3. MapReduce
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is a node?

A

a physical or virtual machine that is part of a Hadoop cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a daemon?

A

a background process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

State the daemons related to YARN (computing) (3)

A
  1. NodeManager daemon
  2. ResourceManager daemon
  3. JobHistoryServer daemon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State the daemons related to HDFS (storage) (3)

A
  1. NameNode daemon
  2. DataNode daemon
  3. SecondaryNameNode daemon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the characteristics of the leader in leader-follower architecture (4)

A
  1. Aware of the follower nodes
  2. Receives external requests
  3. Decides which nodes execute what and when
  4. Communicates with follower nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the characteristics of the follower in leader-follower architecture (2)

A
  1. Acts as a worker node
  2. Executes tasks that leader tells it to
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which two nodes operate in a leader-follower architecture?

A

Leader: NameNode
Follower(s): DataNode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is HDFS?

A

Shared distributed storage among the nodes of the Hadoop cluster, tailored to map reduce jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Where do daemons run?

A

On nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the HDFS responsible for storing?

A

Input and output of MapReduce jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is data stored within the HDFS?

A

In blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the default block size?

A

128MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the minimum parallelisation unit determined?

A

by the HDFS block size, e.g., mappers will work on a block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is 128MB the ideal block size?

A

it balances parallelisation opportunity (favours smaller blocks) with data processing throughput (favours larger blocks)

17
Q

How does a file that is smaller than block size occupy the block?

A

It only occupies the same amount of disk space as the size of the file, so not the entire 128MB

18
Q

What is the purpose of the NameNode?

A

to manage the filesystem namespace, filesystem tree and metadata for all files and directories in the tree

19
Q

What is the purpose of the DataNodes?

A

store and retrieves blocks when instructed and to implement block caching for blocks which are frequently accessed

20
Q

Which node does the DataNode report to?

A

the NameNode

21
Q

Where is the data for the filesystem tree and the related metadata stored?

A

persistently on the local disk in the form of two files: the namespace image and the edit log

22
Q

What does the NameNode know about the files in the HDFS?

A

Which datanodes possess the blocks for a given file and where they are located (but not persistently)

23
Q

How many DataNodes are there per cluster?

A

at least one

24
Q

How many NameNodes are there per cluster?

A

only one

25
Q

What is the purpose of the HDFS SecondaryNamenode?

A

to store a backup copy of index table (communicates periodically with NameNode)

26
Q

What information does the NameNode keep relating to the blocks?

A

An index table with (all) the locations of each block

27
Q

What would happen if the machine running the
NameNode was obliterated?

A

all the files on the filesystem would be lost since there would
be no way of knowing how to reconstruct the files from the blocks on the datanodes

28
Q

How many SecondayNamenodes are there per cluster?

A

only one

29
Q

What is meant by the “move computation to data” principle with HDFS?

A

blocks are stored on certain machines, and the mapper and reduce function will both run locally on that machine without needing to move data between map and reduce processes

30
Q

Which feature of HDFS achieves the “move computation to data” principle?

A

Block replication

31
Q

Why are blocks replicated over the cluster?

A

for fault-tolerance purposes, spreading replicas among different physical locations to improve reliability

32
Q

What is the default number of replicas for each block?

A

3

33
Q

What is YARN?

A

Hadoop’s cluster resource management system

34
Q

what is the relationship between a job and a task?

A

a job usually consists of multiple tasks

35
Q

What are the Hadoop computation tasks? (3)

A
  1. Resource management
  2. Job allocation
  3. Job execution/monitoring
36
Q

How would the estimation for the number of map and reduce tasks be calculated? (2)

A

Based on:
1. input dataset
2. job definition (defined by user)

37
Q

How can you calculate the number of mappers needed?

A

input size/split size

38
Q

What are the different schedulers available in YARN (3)

A

FIFO
Capacity
Fair

39
Q

Why is Hadoop not efficient with I/O? (2)

A
  1. data must be loaded and written from HDFS
  2. shuffle and sort
    incur long latency and produce large network traffic