class 2 Flashcards
file system types
fs (File system)
DFS (distributed file system)
HDFS (Hadoop distributed file system)
drawbacks of ‘FS’
- Storing the large amount of data
- Processing large amount of data
- dataloss:
- power failure
- network failure
- Hardware or software failure - Auto meta data concepts
DFS:
multi node concept
remaining all are same defects of FS
cluster
group of machines in a network
distribution of hadoop
cloudera
hortonworks
IBM big insights
pivotal hd
mapr
Cloud computing:
Amazon EMR
google cloud
windows azure
Hadoop framework
Hadoop 1.x –> 2010
Hadoop 2.x –> Yarn -> 2013
Hadoop 1.x
- Name node -> to store metadata
- data node -> to store actual data
- Secondary Name node -> To maintain backup of name node
- Job tracker -> to split the job into tasks and assign tasks to task tracker
- Task tracker -> to execute the task
client request process sequence
- client sends request to name node (which stores metadata)
- if it is a namenode, the metadata is updated
- if it is an existing request -> the location of actual data is provided (data node)
- secondary namenode backups the name node
- Job tracker: split the tasks and assign to task trackers
- task trackers are initiated where the data resides
Name node contains
- metadata
- location of the data
- ip address of the data nodes
secondary namenode only backups the data, but doesn’t interact with job trackers
hadoop 1.x is also called master slace architecture
master slave architecture
job tracker and task tracker maintaing rpc communication
rpc - remote procedural communication
bydefault - 3 seconds
every 3 second task tracker sends heartbeat to job tracker, to notify that it is working and not down.
case 1: if data node is down
since we have replication, we can execute the task
case 2: if task tracker is down
we can assign the task to another task tracker
case 3: if secondary name node is down
we will not stop job execution, just the backup is stopped
case 4: if the job tracker has stopped
we stop the execution - SPOF (single point of failure)
case 5 : namenode is down
since the secondary namenode not for processing, so SPOF
we have two drawbacks in hadoop, namenode or job tracker is down, then processing is stopped