Hadoop Intro - Week 3 Flashcards

Question 1

Q

Name 3 challenges of big data

Answer

A

Three major challenges with Big Data are storage of large volume, processing heterogeneous data, and fast accessibility.

Question 2

Q

What is apache Hadoop?

Answer

A

an open-source software framework used for storing and processing Big Data in a parallel and distributed manner on large clusters of commodity hardware.

Question 3

Q

What is HDFS? What is YARN?

Answer

A

HDFS handles and stores any kind of data into the cluster.

YARN allows to process the data stored in HDFS

Question 4

Q

Describe HDFS. What is its structure?

Answer

A

HDFS (Hadoop Distributed File System):
○ A distributed FS for storing multi-structure data.
○ Files are stored in data blocks.
○ Data blocks are replicated thus ensuring high availability.
○ Uses master-node architecture

Question 5

Q

What is a name node?

Answer

A

In HDFS, NameNode is the master node, it manages the DataNodes
Performs health checks on the DataNodes, with heartbeats
Keeps track of their size

Question 6

Q

What is a data node

Answer

A

Data node stores data (obvs)
DataNodes are worker nodes
They respond to read and write requests
They send heartbeats to NameNodes

Question 7

Q

Describe YARN in detail

Answer

A

Yet Another Resource Negotiator:

○ Performs the processing activities on Big data.
○ It is responsible for the allocation & scheduling of tasks in
Hadoop.
○ Uses a Master-Slave/Worker architecture
○ Master node is called Resource Manager, while Slave nodes are called Node Managers

Data is processed in parallel across several worker/slave nodes, and processing results are merged at the master node.

Question 8

Q

What is a resource manager?

Answer

A

● Resource Manager
○ Master daemon.
○ Receives requests and sends to the Node Managers
○ Aggregates results from the Node managers
○ It made up of 2 major components – scheduler and application
manager

Question 9

Q

What is secondary name node? What are the 2 components in it, and what do they do?

Answer

A

● Secondary NameNode
○ FSImage: A log of all changes and modification to the Hadoop cluster since its first launch. It is stored on the disk of the NameNode
○ EditLog log of most recent changes. Changes are stored in the NameNode’s RAM
○ Checkpointing: process of combining FSImage and EditLogs. It is carried out by the Secondary NameNode. Checkpointing allows faster failover as it prevents the EditLog from getting to huge.

Hadoop Intro - Week 3 Flashcards

(9 cards)