Hadoop and HDFS Flashcards

1
Q

Hadoop

A

Apache open source software framework for reliable, scalable,
distributed computing over massive amount of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Consists of 4 sub projects:

A

MapReduce

Hadoop Distributed File System (HDFS)

Yarn

Hadoop Common

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hadoop: principles

A

Handles massive amount of data:

Structred, unstructred or semi-structred

Uses Google’s MapReduce and Google File System (GFS)
technologies as its foundation

Uses commodity hardware: relatively inexpensive
computers

Great perfomance for massive parallel processing
Very large file (big) rather than many small files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Hadoop : scalability

A

Vertical scalability : Disk latency (speed of reads and
write)

Horizontal scalbility: spread the data out accross a huge
cluster of machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop is good for:

A

processing massive amounts of data through parallelism

handling a variety of data (structured, unstructured,
semi-structured)

using inexpensive commodity hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hadoop is not good for:

A

processing transactions (random access)

when work cannot be parallelized

low latency data access

processing lots of small files

intensive calculations with small amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

NameNode reads fsimage in memory
DataNode sends blocks reports

A

NameNode Startup
Read FsImage
Read EditLog Transactions
Apply Transactions to FsImage
Save FsImage
Clear EditLog
Start Accepting Requests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In case of failure on NameNode what to do ?

A

recover from FsImage
and EditLog

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Adding a file to HDFS:

A
  1. Client ask NameNode to dertermie which DataNodes will store the
    replication of each block
  2. NameNode create a file in the file system
  3. File is added to NameNode memory by persisting info in edits log
  4. Data is written in blocks to DataNodes
    o DataNode starts chained copy to two other DataNodes
    o if at least one write for each block succeeds, the write is
    successful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

HDFS

A

Distributed, scalable, fault tolerant, high throughput
NOT RANDOM ACCESS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Master: NameNode

A

manages the file system namespace and metadata
 FsImage
 Edits Log
 regulates client access to files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Slave: DataNode

A

manages storage attached to the nodes
*periodically reports status to NameNode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly