Apache Hadoop Flashcards
What is Hadoop?
open-source software framework for distributed storage and processing of large data sets, using a clustered network of machines
What are the key components of Apache Hadoop? (3)
- HDFS
- YARN
- MapReduce
what is a node?
a physical or virtual machine that is part of a Hadoop cluster
What is a daemon?
a background process
State the daemons related to YARN (computing) (3)
- NodeManager daemon
- ResourceManager daemon
- JobHistoryServer daemon
State the daemons related to HDFS (storage) (3)
- NameNode daemon
- DataNode daemon
- SecondaryNameNode daemon
Describe the characteristics of the leader in leader-follower architecture (4)
- Aware of the follower nodes
- Receives external requests
- Decides which nodes execute what and when
- Communicates with follower nodes
Describe the characteristics of the follower in leader-follower architecture (2)
- Acts as a worker node
- Executes tasks that leader tells it to
Which two nodes operate in a leader-follower architecture?
Leader: NameNode
Follower(s): DataNode
What is HDFS?
Shared distributed storage among the nodes of the Hadoop cluster, tailored to map reduce jobs
Where do daemons run?
On nodes
What is the HDFS responsible for storing?
Input and output of MapReduce jobs
How is data stored within the HDFS?
In blocks
What is the default block size?
128MB
How is the minimum parallelisation unit determined?
by the HDFS block size, e.g., mappers will work on a block
Why is 128MB the ideal block size?
it balances parallelisation opportunity (favours smaller blocks) with data processing throughput (favours larger blocks)
How does a file that is smaller than block size occupy the block?
It only occupies the same amount of disk space as the size of the file, so not the entire 128MB
What is the purpose of the NameNode?
to manage the filesystem namespace, filesystem tree and metadata for all files and directories in the tree
What is the purpose of the DataNodes?
store and retrieves blocks when instructed and to implement block caching for blocks which are frequently accessed
Which node does the DataNode report to?
the NameNode
Where is the data for the filesystem tree and the related metadata stored?
persistently on the local disk in the form of two files: the namespace image and the edit log
What does the NameNode know about the files in the HDFS?
Which datanodes possess the blocks for a given file and where they are located (but not persistently)
How many DataNodes are there per cluster?
at least one
How many NameNodes are there per cluster?
only one
What is the purpose of the HDFS SecondaryNamenode?
to store a backup copy of index table (communicates periodically with NameNode)
What information does the NameNode keep relating to the blocks?
An index table with (all) the locations of each block
What would happen if the machine running the
NameNode was obliterated?
all the files on the filesystem would be lost since there would
be no way of knowing how to reconstruct the files from the blocks on the datanodes
How many SecondayNamenodes are there per cluster?
only one
What is meant by the “move computation to data” principle with HDFS?
blocks are stored on certain machines, and the mapper and reduce function will both run locally on that machine without needing to move data between map and reduce processes
Which feature of HDFS achieves the “move computation to data” principle?
Block replication
Why are blocks replicated over the cluster?
for fault-tolerance purposes, spreading replicas among different physical locations to improve reliability
What is the default number of replicas for each block?
3
What is YARN?
Hadoop’s cluster resource management system
what is the relationship between a job and a task?
a job usually consists of multiple tasks
What are the Hadoop computation tasks? (3)
- Resource management
- Job allocation
- Job execution/monitoring
How would the estimation for the number of map and reduce tasks be calculated? (2)
Based on:
1. input dataset
2. job definition (defined by user)
How can you calculate the number of mappers needed?
input size/split size
What are the different schedulers available in YARN (3)
FIFO
Capacity
Fair
Why is Hadoop not efficient with I/O? (2)
- data must be loaded and written from HDFS
- shuffle and sort
incur long latency and produce large network traffic