7. Big Data - Hadoop Ecosystem Flashcards

Question

Proximity factor steps

Answer 1

The idea is that the bandwidth available for each of the following scenarios becomes progressively less: Processes on the same node Different nodes on the same rack Nodes on different racks in the same data center Nodes in different data centers[32]

Answer 2

1. Hadoop’s default strategy is to place the **first replica** on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). 2. The **second replica** is placed on a different rack from the first (off-rack), chosen at random. 3. The **third replica** is placed on the same rack as the second, but on a different node chosen at random. 4. **Further replicas** are placed on random nodes in the cluster, although the system tries to avoid placing too many replicas on the same rack. Overall, this strategy gives a good balance among reliability (blocks are stored on two racks), write bandwidth (writes only have to traverse a single network switch), read performance (there’s a choice of two racks to read from), and block distribution across the cluster (clients only write a single block on the local rack).

Answer 3

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.

Answer 4

YARN provides its core services via two types of long-running daemon: 1. **resource manager** (one per cluster) to manage the use of resources across the cluster 2. **node managers** running on all the nodes in the cluster to launch and monitor containers.

Answer 5

1. Each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit rot” in the physical storage media.

Answer 6

1. NameNode 2. Secondary NameNode 3. Resource Manager 4. History Server

Answer 7

For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the resource manager on a single master machine (as long as at least one copy of the namenode’s metadata is stored on a remote filesystem). However, as the cluster gets larger, there are good reasons to separate them.

Answer 8

If your cluster runs on a single rack, then there is nothing more to do, since this is the default. However, for multirack clusters, you need to: 1. map nodes to racks. This allows Hadoop to prefer within-rack transfers (where there is more bandwidth available) to off-rack transfers when placing MapReduce tasks on nodes. HDFS will also be able to place replicas more intelligently to trade off performance and resilience.

Answer 9

1. Each fsimage file contains a serialized form of all the **directory** and **file inodes** in the filesystem. 2. Each inode is an internal representation of a file or directory’s metadata and contains such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks the file is made up of. For directories, the modification time, permissions, and quota metadata are stored.

Answer 10

1. Metadata backups (hdfs dfsadmin -fetchImage fsimage.backup) 2. Backup data 3. Filesystem check (It is advisable to run HDFS’s fsck tool regularly (i.e., daily) on the whole filesystem to proactively look for missing or corrupt blocks.) 4. Filesystem balancer (Run the balancer tool (see Balancer) regularly to keep the filesystem datanodes evenly balanced.)

Answer 11

Do not make the mistake of thinking that HDFS replication is a substitute for making backups. Bugs in HDFS can cause replicas to be lost, and so can hardware failures. Although Hadoop is expressly designed so that hardware failure is very unlikely to result in data loss, the possibility can never be completely ruled out, particularly when combined with software bugs or human error.

7. Big Data - Hadoop Ecosystem Flashcards

(36 cards)