7. Big Data - Hadoop Ecosystem Flashcards

1
Q

DataNode function

A

Store data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NameNode function

A
  1. Store metadata 2. Know what DataNode each block is located
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Qual era o volume de dados no mundo em 2013

A

4.4 x 10^21

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Qual é o problema dos discos para processar grande volume de dados?

A

A velocidade de leitura de um disco de 1TB é de no máximo 100 MB/s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

O que é Hive?

A

SQL que roda com dados no HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Spark?

A

Processamento interativo na memória

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Solr

A

Search de dados no HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Qual é o primeiro passo do cientista de dados?

A

Definir bem a questão, o que busca responder com os dados.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fases do MapReduce

A
  1. Map - filtra o que precisa 2. Reduce - junta os dados em um resumo
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MapReduce Jobs (components)

A

Input Data + MapReduce Program + Config Info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

O que são distributed filesystems?

A

Filesystems that manage the storage across a network of machines are called distributed filesystems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Quais são os componentes em mente no desenho do HDFS?

A
  1. Very large files
  2. Streaming data access
  3. Commodity hardware
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Em quais situações o HDFS não se encaixa bem?

A
  1. Low-latency data access 2. Lots of small files 3. Multiple writers, arbitrary file modifications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Qual é o tamanho padrão do “block” do HDFS?

A

128 MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Para quantos servidores tipicamente é replicado um “block” de dados?

A

Três servidores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Quais são as medidas de proteção do NameNode?

A
  1. Realizar backup dos arquivos de Metadata do NameNode.
  2. Rodar um “secondary NameNode”.
  3. Instalar no modelo “Active/Standby” disponível a partir da versão 2.
17
Q

Quais são as etapas de inicialização do NameNode?

A

(i) loaded its namespace image into memory (ii) replayed its edit log, and (iii) received enough block reports from the datanodes to leave safe mode.

18
Q

Quanto tempo leva para o NameNode inicializar em um cluster grande?

A

On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.

19
Q

Quem é o responsável por gerenciar o processo de failover entre o NameNode ativo e passivo?

A

Failover controller. The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. There are various failover controllers, but the default implementation uses ZooKeeper to ensure that only one namenode is active. Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures (using a simple heartbeating mechanism) and trigger a failover should a namenode fail.

20
Q

Qual propriedade define a quantidade de DataNodes que um “block” é replicado?

A

dfs.replication = x

21
Q

Qual o comando para conseguir ajuda dos comandos do HDFS?

A

hadoop fs -help

22
Q

Qual a propriedade para habilitar a segurança do HDFS?

A

dfs.permissions.enabled

23
Q

Como funciona o processo de abertura de um arquivo no HDFS?

A

The client opens the file it wishes to read by calling open() on the FileSystem object, which for HDFS is an instance of DistributedFileSystem (step 1 in Figure 3-2). DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file (step 2). For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client (according to the topology of the cluster’s network; see Network Topology and Hadoop). If the client is itself a datanode (in the case of a MapReduce task, for instance), the client will read from the local datanode if that datanode hosts a copy of the block (see also Figure 2-2 and Short-circuit local reads).

24
Q

What does it mean for two nodes in a local network to be “close” to each other?

A

Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. Levels in the tree are not predefined, but it is common to have levels that correspond to the data center, the rack, and the node that a process is running on.

25
Q

Proximity factor steps

A

The idea is that the bandwidth available for each of the following scenarios becomes progressively less: Processes on the same node Different nodes on the same rack Nodes on different racks in the same data center Nodes in different data centers[32]

26
Q

How does the namenode choose which datanodes to store replicas on?

A
  1. Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy).
  2. The second replica is placed on a different rack from the first (off-rack), chosen at random.
  3. The third replica is placed on the same rack as the second, but on a different node chosen at random.
  4. Further replicas are placed on random nodes in the cluster, although the system tries to avoid placing too many replicas on the same rack.

Overall, this strategy gives a good balance among reliability (blocks are stored on two racks), write bandwidth (writes only have to traverse a single network switch), read performance (there’s a choice of two racks to read from), and block distribution across the cluster (clients only write a single block on the local rack).

27
Q

What is YARN?

A

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.

28
Q

What services does YARN provide?

A

YARN provides its core services via two types of long-running daemon:

  1. resource manager (one per cluster) to manage the use of resources across the cluster
  2. node managers running on all the nodes in the cluster to launch and monitor containers.
29
Q

How does YARN run applications?

A
30
Q

How does HDFS detect file corruption?

A
  1. Each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit rot” in the physical storage media.
31
Q

What are the master daemons in Hadoop?

A
  1. NameNode
  2. Secondary NameNode
  3. Resource Manager
  4. History Server
32
Q

Is it acceptable to run the NameNode and Resource Manager in the same machine?

A

For a small cluster (on the order of 10 nodes), it is usually acceptable to run the namenode and the resource manager on a single master machine (as long as at least one copy of the namenode’s metadata is stored on a remote filesystem). However, as the cluster gets larger, there are good reasons to separate them.

33
Q

What do you need to do when your DataNode are installed in different racks?

A

If your cluster runs on a single rack, then there is nothing more to do, since this is the default. However, for multirack clusters, you need to:

  1. map nodes to racks. This allows Hadoop to prefer within-rack transfers (where there is more bandwidth available) to off-rack transfers when placing MapReduce tasks on nodes. HDFS will also be able to place replicas more intelligently to trade off performance and resilience.
34
Q

What is the fsimage file in the HDFS?

A
  1. Each fsimage file contains a serialized form of all the directory and file inodes in the filesystem.
  2. Each inode is an internal representation of a file or directory’s metadata and contains such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks the file is made up of. For directories, the modification time, permissions, and quota metadata are stored.
35
Q

List common maintenance procedures

A
  1. Metadata backups (hdfs dfsadmin -fetchImage fsimage.backup)
  2. Backup data
  3. Filesystem check (It is advisable to run HDFS’s fsck tool regularly (i.e., daily) on the whole filesystem to proactively look for missing or corrupt blocks.)
  4. Filesystem balancer (Run the balancer tool (see Balancer) regularly to keep the filesystem datanodes evenly balanced.)
36
Q

Is it necessary to perform backup of DataNode data?

A

Do not make the mistake of thinking that HDFS replication is a substitute for making backups. Bugs in HDFS can cause replicas to be lost, and so can hardware failures. Although Hadoop is expressly designed so that hardware failure is very unlikely to result in data loss, the possibility can never be completely ruled out, particularly when combined with software bugs or human error.