Hadoop Intro - Week 3 Flashcards
Name 3 challenges of big data
Three major challenges with Big Data are storage of large volume, processing heterogeneous data, and fast accessibility.
What is apache Hadoop?
an open-source software framework used for storing and processing Big Data in a parallel and distributed manner on large clusters of commodity hardware.
What is HDFS? What is YARN?
HDFS handles and stores any kind of data into the cluster.
YARN allows to process the data stored in HDFS
Describe HDFS. What is its structure?
HDFS (Hadoop Distributed File System):
○ A distributed FS for storing multi-structure data.
○ Files are stored in data blocks.
○ Data blocks are replicated thus ensuring high availability.
○ Uses master-node architecture
What is a name node?
In HDFS, NameNode is the master node, it manages the DataNodes
Performs health checks on the DataNodes, with heartbeats
Keeps track of their size
What is a data node
Data node stores data (obvs)
DataNodes are worker nodes
They respond to read and write requests
They send heartbeats to NameNodes
Describe YARN in detail
Yet Another Resource Negotiator:
○ Performs the processing activities on Big data.
○ It is responsible for the allocation & scheduling of tasks in
Hadoop.
○ Uses a Master-Slave/Worker architecture
○ Master node is called Resource Manager, while Slave nodes are called Node Managers
Data is processed in parallel across several worker/slave nodes, and processing results are merged at the master node.
What is a resource manager?
● Resource Manager
○ Master daemon.
○ Receives requests and sends to the Node Managers
○ Aggregates results from the Node managers
○ It made up of 2 major components – scheduler and application
manager
What is secondary name node? What are the 2 components in it, and what do they do?
● Secondary NameNode
○ FSImage: A log of all changes and modification to the Hadoop cluster since its first launch. It is stored on the disk of the NameNode
○ EditLog log of most recent changes. Changes are stored in the NameNode’s RAM
○ Checkpointing: process of combining FSImage and EditLogs. It is carried out by the Secondary NameNode. Checkpointing allows faster failover as it prevents the EditLog from getting to huge.