Ch 3 - class 3 Hadoop FIle System Flashcards
what are two componants of hadooop
mapreduce and hdfs
hdfs
file system to manage hard drive. on top of file system on hard drive
command interface
use to communicate hdfs and hdd
communicate to server from hdrive
winscp
file system deal with
large files. write once, read many times, high throughput
data size?
block size. hdfs divided into blocks. 64mb by default, 128mb in practice.
can many files be on same block?
YES!
check status of file system block
% hadoop fsck –files -blocks
Namenode
Manage filesystem namespace, keep track of blocks, block locations, namespace image
cluster
name node, datanode
single point of failure
persistand metadata files
system has 2 namenodes
active and standby
datanode known as
workhorse of the file sytem. store and retreive blocks, report to namenode.
HDFS high avaiability
use pair of namenodes in active-standby configuration.
standby has latest log entreis and up to date block mapping in memory
how do you set replication for data node
set dfs.replication=3
Psudo-distribted configuration
fs. dautlname=hdfs://localhost/
dfs. replication=1
where is default filesystem
on master computer, namenode
where is local filesystem?
on the server
command to copy from hard drive to hdfs
hadoop fs -copyFromLocal input/docs/quangle.txt
hdfs://localhost/user/tom/quangle.txt
for checksum
use md5 to check file integrity to compare
md5sum used hash function producing a 256 bit hash value, gives a checksum to verify data integrity.
256bit =32bytes=32characters.
split of data
you want to split the data so it fits on 1 block..
so use block size for split size
hdfs
just one implemenation of hadoop filesystem, s3 another.
2 waits to catch exception
try or finally , 2 ways to catch the exception
finally
regardless of exception or not. thats how diff from catch
it’s a strong method.
how to tell if hdfs command or not
you will see hadoop fs
not hadoop URLCAT .etc.
hadoop URLCAT is java program
FileSystemCat
public java hadoop program to handle file stream
bbbbbbbbbbbbbbb
complete java program
public class and main method
glob characters
regular expressions
datanode error - client
ads and packets in ack queue to data queue
removeds the failed datanode from the pileline.
namenode - arranges under replicated blocks for further replicas.
failed datanode- deleted the parital block when the node recovers later on.
ack
sent when data is received by datanode, not when written.
way to write in paraallel to speed up process of copying data.
distcp is used for copying large amounts of data to and from hadoop filesystems in parallel.
RPC
remote procedure call. communication with the nodes.
HAR Files
are file archiving facility that packes files into HDFS blocks more efficiently.
L=
r=
long, -r Recursive means show all the entries in the subtree
-p
P option preserves file attributes(timestamp, ownership, permission, etc)