Apache Hadoop Flashcards by Adina G

What is Hadoop?

open-source software framework for distributed storage and processing of large data sets, using a clustered network of machines

How well did you know this?

Not at all

Perfectly

What are the key components of Apache Hadoop? (3)

HDFS
YARN
MapReduce

How well did you know this?

Not at all

Perfectly

what is a node?

a physical or virtual machine that is part of a Hadoop cluster

How well did you know this?

Not at all

Perfectly

What is a daemon?

a background process

How well did you know this?

Not at all

Perfectly

State the daemons related to YARN (computing) (3)

NodeManager daemon
ResourceManager daemon
JobHistoryServer daemon

How well did you know this?

Not at all

Perfectly

State the daemons related to HDFS (storage) (3)

NameNode daemon
DataNode daemon
SecondaryNameNode daemon

How well did you know this?

Not at all

Perfectly

Describe the characteristics of the leader in leader-follower architecture (4)

Aware of the follower nodes
Receives external requests
Decides which nodes execute what and when
Communicates with follower nodes

How well did you know this?

Not at all

Perfectly

Describe the characteristics of the follower in leader-follower architecture (2)

Acts as a worker node
Executes tasks that leader tells it to

How well did you know this?

Not at all

Perfectly

Which two nodes operate in a leader-follower architecture?

Leader: NameNode
Follower(s): DataNode

How well did you know this?

Not at all

Perfectly

What is HDFS?

Shared distributed storage among the nodes of the Hadoop cluster, tailored to map reduce jobs

How well did you know this?

Not at all

Perfectly

Where do daemons run?

On nodes

How well did you know this?

Not at all

Perfectly

What is the HDFS responsible for storing?

Input and output of MapReduce jobs

How well did you know this?

Not at all

Perfectly

How is data stored within the HDFS?

In blocks

How well did you know this?

Not at all

Perfectly

What is the default block size?

128MB

How well did you know this?

Not at all

Perfectly

How is the minimum parallelisation unit determined?

by the HDFS block size, e.g., mappers will work on a block

How well did you know this?

Not at all

Perfectly

Why is 128MB the ideal block size?

Study These Flashcards

it balances parallelisation opportunity (favours smaller blocks) with data processing throughput (favours larger blocks)

How does a file that is smaller than block size occupy the block?

Study These Flashcards

It only occupies the same amount of disk space as the size of the file, so not the entire 128MB

What is the purpose of the NameNode?

Study These Flashcards

to manage the filesystem namespace, filesystem tree and metadata for all files and directories in the tree

What is the purpose of the DataNodes?

Study These Flashcards

store and retrieves blocks when instructed and to implement block caching for blocks which are frequently accessed

Which node does the DataNode report to?

Study These Flashcards

the NameNode

Where is the data for the filesystem tree and the related metadata stored?

Study These Flashcards

persistently on the local disk in the form of two files: the namespace image and the edit log

What does the NameNode know about the files in the HDFS?

Study These Flashcards

Which datanodes possess the blocks for a given file and where they are located (but not persistently)

How many DataNodes are there per cluster?

Study These Flashcards

at least one

How many NameNodes are there per cluster?

Study These Flashcards

only one

What is the purpose of the HDFS SecondaryNamenode?

to store a backup copy of index table (communicates periodically with NameNode)

What information does the NameNode keep relating to the blocks?

An index table with (all) the locations of each block

What would happen if the machine running the NameNode was obliterated?

all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes

How many SecondayNamenodes are there per cluster?

only one

What is meant by the “move computation to data” principle with HDFS?

blocks are stored on certain machines, and the mapper and reduce function will both run locally on that machine without needing to move data between map and reduce processes

Which feature of HDFS achieves the “move computation to data” principle?

Block replication

Why are blocks replicated over the cluster?

for fault-tolerance purposes, spreading replicas among different physical locations to improve reliability

What is the default number of replicas for each block?

What is YARN?

Hadoop’s cluster resource management system

what is the relationship between a job and a task?

a job usually consists of multiple tasks

What are the Hadoop computation tasks? (3)

1. Resource management 2. Job allocation 3. Job execution/monitoring

How would the estimation for the number of map and reduce tasks be calculated? (2)

Based on: 1. input dataset 2. job definition (defined by user)

How can you calculate the number of mappers needed?

input size/split size

What are the different schedulers available in YARN (3)

FIFO Capacity Fair

Why is Hadoop not efficient with I/O? (2)

1. data must be loaded and written from HDFS 2. shuffle and sort incur long latency and produce large network traffic

Apache Hadoop Flashcards

(39 cards)