class 2 Flashcards

1
Q

file system types

A

fs (File system)

DFS (distributed file system)

HDFS (Hadoop distributed file system)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

drawbacks of ‘FS’

A
  1. Storing the large amount of data
  2. Processing large amount of data
  3. dataloss:
    - power failure
    - network failure
    - Hardware or software failure
  4. Auto meta data concepts

DFS:

multi node concept

remaining all are same defects of FS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

cluster

A

group of machines in a network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

distribution of hadoop

A

cloudera

hortonworks

IBM big insights

pivotal hd

mapr

Cloud computing:

Amazon EMR

google cloud

windows azure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop framework

A

Hadoop 1.x –> 2010

Hadoop 2.x –> Yarn -> 2013

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Hadoop 1.x

A
  1. Name node -> to store metadata
  2. data node -> to store actual data
  3. Secondary Name node -> To maintain backup of name node
  4. Job tracker -> to split the job into tasks and assign tasks to task tracker
  5. Task tracker -> to execute the task
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

client request process sequence

A
  1. client sends request to name node (which stores metadata)
  2. if it is a namenode, the metadata is updated
  3. if it is an existing request -> the location of actual data is provided (data node)
  4. secondary namenode backups the name node
  5. Job tracker: split the tasks and assign to task trackers
  6. task trackers are initiated where the data resides
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name node contains

A
  1. metadata
  2. location of the data
  3. ip address of the data nodes

secondary namenode only backups the data, but doesn’t interact with job trackers

hadoop 1.x is also called master slace architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

master slave architecture

A

job tracker and task tracker maintaing rpc communication

rpc - remote procedural communication

bydefault - 3 seconds

every 3 second task tracker sends heartbeat to job tracker, to notify that it is working and not down.

case 1: if data node is down

since we have replication, we can execute the task

case 2: if task tracker is down

we can assign the task to another task tracker

case 3: if secondary name node is down

we will not stop job execution, just the backup is stopped

case 4: if the job tracker has stopped

we stop the execution - SPOF (single point of failure)

case 5 : namenode is down

since the secondary namenode not for processing, so SPOF

we have two drawbacks in hadoop, namenode or job tracker is down, then processing is stopped

How well did you know this?
1
Not at all
2
3
4
5
Perfectly