Hadoop Intro - Week 3 Flashcards

1
Q

Name 3 challenges of big data

A

Three major challenges with Big Data are storage of large volume, processing heterogeneous data, and fast accessibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is apache Hadoop?

A

an open-source software framework used for storing and processing Big Data in a parallel and distributed manner on large clusters of commodity hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is HDFS? What is YARN?

A

HDFS handles and stores any kind of data into the cluster.

YARN allows to process the data stored in HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe HDFS. What is its structure?

A

HDFS (Hadoop Distributed File System):
○ A distributed FS for storing multi-structure data.
○ Files are stored in data blocks.
○ Data blocks are replicated thus ensuring high availability.
○ Uses master-node architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a name node?

A

In HDFS, NameNode is the master node, it manages the DataNodes
Performs health checks on the DataNodes, with heartbeats
Keeps track of their size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a data node

A

Data node stores data (obvs)
DataNodes are worker nodes
They respond to read and write requests
They send heartbeats to NameNodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe YARN in detail

A

Yet Another Resource Negotiator:

○ Performs the processing activities on Big data.
○ It is responsible for the allocation & scheduling of tasks in
Hadoop.
○ Uses a Master-Slave/Worker architecture
○ Master node is called Resource Manager, while Slave nodes are called Node Managers

Data is processed in parallel across several worker/slave nodes, and processing results are merged at the master node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a resource manager?

A

● Resource Manager
○ Master daemon.
○ Receives requests and sends to the Node Managers
○ Aggregates results from the Node managers
○ It made up of 2 major components – scheduler and application
manager

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is secondary name node? What are the 2 components in it, and what do they do?

A

● Secondary NameNode
○ FSImage: A log of all changes and modification to the Hadoop cluster since its first launch. It is stored on the disk of the NameNode
○ EditLog log of most recent changes. Changes are stored in the NameNode’s RAM
○ Checkpointing: process of combining FSImage and EditLogs. It is carried out by the Secondary NameNode. Checkpointing allows faster failover as it prevents the EditLog from getting to huge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly