Hadoop and HDFS Flashcards

Question 1

Q

Hadoop

Answer

A

Apache open source software framework for reliable, scalable,
distributed computing over massive amount of data

Question 2

Q

Consists of 4 sub projects:

Answer

A

MapReduce

Hadoop Distributed File System (HDFS)

Yarn

Hadoop Common

Question 3

Q

Hadoop: principles

Answer

A

Handles massive amount of data:

Structred, unstructred or semi-structred

Uses Google’s MapReduce and Google File System (GFS)
technologies as its foundation

Uses commodity hardware: relatively inexpensive
computers

Great perfomance for massive parallel processing
Very large file (big) rather than many small files

Question 4

Q

Hadoop : scalability

Answer

A

Vertical scalability : Disk latency (speed of reads and
write)

Horizontal scalbility: spread the data out accross a huge
cluster of machines

Question 5

Q

Hadoop is good for:

Answer

A

processing massive amounts of data through parallelism

handling a variety of data (structured, unstructured,
semi-structured)

using inexpensive commodity hardware

Question 6

Q

Hadoop is not good for:

Answer

A

processing transactions (random access)

when work cannot be parallelized

low latency data access

processing lots of small files

intensive calculations with small amounts of data

Question 7

Q

NameNode reads fsimage in memory
DataNode sends blocks reports

Answer

A

NameNode Startup
Read FsImage
Read EditLog Transactions
Apply Transactions to FsImage
Save FsImage
Clear EditLog
Start Accepting Requests

Question 8

Q

In case of failure on NameNode what to do ?

Answer

A

recover from FsImage
and EditLog

Question 9

Q

Adding a file to HDFS:

Answer

A

Client ask NameNode to dertermie which DataNodes will store the
replication of each block
NameNode create a file in the file system
File is added to NameNode memory by persisting info in edits log
Data is written in blocks to DataNodes
o DataNode starts chained copy to two other DataNodes
o if at least one write for each block succeeds, the write is
successful

Question 10

Q

HDFS

Answer

A

Distributed, scalable, fault tolerant, high throughput
NOT RANDOM ACCESS

Question 11

Q

Master: NameNode

Answer

A

manages the file system namespace and metadata
 FsImage
 Edits Log
 regulates client access to files

Question 12

Q

Slave: DataNode

Answer

A

manages storage attached to the nodes
*periodically reports status to NameNode

Hadoop and HDFS Flashcards

(12 cards)