Apache Hadoop Flashcards
What is Hadoop?
open-source software framework for distributed storage and processing of large data sets, using a clustered network of machines
What are the key components of Apache Hadoop? (3)
- HDFS
- YARN
- MapReduce
what is a node?
a physical or virtual machine that is part of a Hadoop cluster
What is a daemon?
a background process
State the daemons related to YARN (computing) (3)
- NodeManager daemon
- ResourceManager daemon
- JobHistoryServer daemon
State the daemons related to HDFS (storage) (3)
- NameNode daemon
- DataNode daemon
- SecondaryNameNode daemon
Describe the characteristics of the leader in leader-follower architecture (4)
- Aware of the follower nodes
- Receives external requests
- Decides which nodes execute what and when
- Communicates with follower nodes
Describe the characteristics of the follower in leader-follower architecture (2)
- Acts as a worker node
- Executes tasks that leader tells it to
Which two nodes operate in a leader-follower architecture?
Leader: NameNode
Follower(s): DataNode
What is HDFS?
Shared distributed storage among the nodes of the Hadoop cluster, tailored to map reduce jobs
Where do daemons run?
On nodes
What is the HDFS responsible for storing?
Input and output of MapReduce jobs
How is data stored within the HDFS?
In blocks
What is the default block size?
128MB
How is the minimum parallelisation unit determined?
by the HDFS block size, e.g., mappers will work on a block