Hadoop and HDFS Flashcards
Hadoop
Apache open source software framework for reliable, scalable,
distributed computing over massive amount of data
Consists of 4 sub projects:
MapReduce
Hadoop Distributed File System (HDFS)
Yarn
Hadoop Common
Hadoop: principles
Handles massive amount of data:
Structred, unstructred or semi-structred
Uses Google’s MapReduce and Google File System (GFS)
technologies as its foundation
Uses commodity hardware: relatively inexpensive
computers
Great perfomance for massive parallel processing
Very large file (big) rather than many small files
Hadoop : scalability
Vertical scalability : Disk latency (speed of reads and
write)
Horizontal scalbility: spread the data out accross a huge
cluster of machines
Hadoop is good for:
processing massive amounts of data through parallelism
handling a variety of data (structured, unstructured,
semi-structured)
using inexpensive commodity hardware
Hadoop is not good for:
processing transactions (random access)
when work cannot be parallelized
low latency data access
processing lots of small files
intensive calculations with small amounts of data
NameNode reads fsimage in memory
DataNode sends blocks reports
NameNode Startup
Read FsImage
Read EditLog Transactions
Apply Transactions to FsImage
Save FsImage
Clear EditLog
Start Accepting Requests
In case of failure on NameNode what to do ?
recover from FsImage
and EditLog
Adding a file to HDFS:
- Client ask NameNode to dertermie which DataNodes will store the
replication of each block - NameNode create a file in the file system
- File is added to NameNode memory by persisting info in edits log
- Data is written in blocks to DataNodes
o DataNode starts chained copy to two other DataNodes
o if at least one write for each block succeeds, the write is
successful
HDFS
Distributed, scalable, fault tolerant, high throughput
NOT RANDOM ACCESS
Master: NameNode
manages the file system namespace and metadata
FsImage
Edits Log
regulates client access to files
Slave: DataNode
manages storage attached to the nodes
*periodically reports status to NameNode