Hadoop Ecosystem Fundamentals of Distributed Systems continued Flashcards

Question

What is the ApplicationMaster in Hadoop 2.x architecture?

Answer 1

The ApplicationMaster is a per-application framework-specific master daemon responsible for negotiating resources from the ResourceManager and managing the execution of application tasks. Each application running on the cluster has its own ApplicationMaster, which coordinates the execution of tasks across multiple containers.

Answer 2

Hadoop 2.x architecture supports multiple data processing frameworks, including MapReduce, Apache Spark, Apache Flink, Apache Tez, and others. These frameworks can run on top of YARN, utilizing its resource management capabilities for efficient and flexible data processing.

Answer 3

ZooKeeper is a distributed coordination service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is designed to be highly available, fault-tolerant, and scalable.

Answer 4

ZooKeeper is commonly used for distributed systems coordination, such as leader election, distributed locking, configuration management, and maintaining consistent metadata. It is also used in distributed messaging systems, distributed databases, and other distributed applications requiring coordination and synchronization.

Answer 5

ZooKeeper employs a hierarchical namespace called ZNodes, which are similar to files and directories in a file system. It uses a consensus protocol (ZAB - ZooKeeper Atomic Broadcast) for maintaining consistency and coordination among distributed nodes. ZooKeeper ensemble consists of multiple nodes (ZooKeeper servers) forming a quorum for fault tolerance and high availability.

Answer 6

A quorum in ZooKeeper refers to a majority of nodes in the ZooKeeper ensemble. A quorum of nodes must agree on changes to the state of the system for operations to proceed, ensuring consistency and fault tolerance.

Answer 7

Applications interact with ZooKeeper using its client API, which provides primitives for creating, reading, updating, and deleting ZNodes. ZooKeeper clients establish connections to one or more ZooKeeper servers in the ensemble and receive notifications about changes to the ZNodes they are interested in.

Answer 8

ZooKeeper supports various authentication mechanisms, including Kerberos, digest, and SSL/TLS. Clients connecting to ZooKeeper servers must authenticate themselves using one of these mechanisms before accessing data.

Answer 9

Authorization in ZooKeeper involves controlling access to ZNodes based on user roles and permissions. ZooKeeper uses an Access Control List (ACL) model to define permissions for each ZNode, specifying which users or groups have read, write, and administration privileges.

Answer 10

ACLs are attached to each ZNode in ZooKeeper to specify who has access to that node and what operations they are allowed to perform. ACLs consist of a list of permissions (e.g., read, write, create, delete) associated with individual users or groups.

Answer 11

ZooKeeper supports encryption of data in transit using SSL/TLS protocols. By enabling encryption, ZooKeeper ensures that data exchanged between clients and servers is encrypted, preventing eavesdropping and tampering.

Answer 12

ZooKeeper uses SSL/TLS to encrypt communication between clients and servers, providing confidentiality and integrity. Clients can configure ZooKeeper to use SSL/TLS for secure communication by enabling encryption in the client configuration.

Answer 13

In an Active-Passive HA architecture, there are two or more redundant systems, but only one is active at a time. The passive systems remain in standby mode, ready to take over if the active system fails. Failover to the passive system typically requires manual intervention or automated detection of failure.

Answer 14

In an Active-Active HA architecture, multiple redundant systems are active and serving traffic simultaneously. Traffic is distributed among the active systems, allowing for load balancing and scalability. Each system is capable of handling the entire workload independently, providing fault tolerance and high availability.

Answer 15

N+1 Redundancy refers to having one extra redundant system or component beyond what is necessary to handle the expected load. If one system fails, the extra redundant system can take over to maintain service availability without interruption. N+1 redundancy provides fault tolerance and ensures uninterrupted service even during system failures.

Answer 16

In an Active-Standby HA architecture, there is one active system serving traffic and one or more standby systems in standby mode. The standby systems are ready to take over if the active system fails, but they do not serve traffic under normal conditions. Failover to the standby system can be manual or automatic, depending on the configuration and requirements.

Answer 17

Load Balancing distributes incoming traffic across multiple servers or resources to ensure optimal resource utilization and prevent overload. By distributing traffic among redundant systems, load balancing enhances fault tolerance and improves system reliability and availability.

Answer 18

Erasure coding is a method of data protection and redundancy in which data is divided into fragments, expanded, and encoded with redundant information. This redundancy allows the original data to be reconstructed even if some fragments are lost or corrupted.

Answer 19

Erasure coding generates additional redundant fragments, known as parity or checksums, alongside the original data fragments. These redundant fragments contain information that enables the reconstruction of the original data, even if some fragments are missing or damaged.

Answer 20

Erasure coding enhances fault tolerance by allowing data to be reconstructed from remaining fragments in the event of fragment loss or corruption. The redundancy introduced by erasure coding reduces the likelihood of data loss or unavailability due to hardware failures or data corruption.

Answer 21

Erasure coding typically requires less storage overhead compared to traditional replication-based redundancy schemes. It achieves higher storage efficiency by generating fewer redundant fragments while still providing the same level of fault tolerance.

Answer 22

Erasure coding is commonly used in distributed storage systems, cloud storage, object storage, and archival systems where data durability and reliability are essential. It is particularly well-suited for large-scale storage deployments requiring efficient use of storage resources and high fault tolerance.

Answer 23

HDFS divides large files into fixed-size blocks, typically 128 MB or 256 MB in size. Each block is replicated across multiple DataNodes in the cluster to ensure fault tolerance and data availability

Answer 24

The replication factor is a configurable parameter in HDFS that determines the number of replicas created for each data block. The default replication factor is usually 3, meaning that each block is replicated three times across the cluster.

Answer 25

HDFS aims to distribute data blocks across different racks and DataNodes to minimize data transfer over the network and improve fault tolerance. The replication algorithm takes into account the proximity of DataNodes to optimize data placement and ensure data locality.

Answer 26

Rack Awareness is a feature of HDFS that allows it to be aware of the network topology, including racks and DataNodes' locations within the racks. HDFS uses Rack Awareness to ensure that replicas of data blocks are stored on different racks for fault tolerance and network efficiency.

Answer 27

When a client writes data to HDFS, the NameNode selects a set of DataNodes to store replicas of the data blocks. The selection considers factors such as DataNode availability, rack proximity, and the desired replication factor.

Answer 28

Racks are physical network switches or cabinets that contain multiple DataNodes in a Hadoop cluster. Racks are the basic unit of network topology in HDFS, and DataNodes within the same rack are typically connected by high-speed local network links.

Answer 29

DataNodes are individual machines in a Hadoop cluster that store data and perform data processing tasks. DataNodes are organized into racks based on their physical location in the data center.

Answer 30

The NameNode, which manages the metadata of the file system, is typically located outside the racks containing DataNodes. This placement ensures that the NameNode remains accessible even if an entire rack or set of racks goes offline.

Answer 31

HDFS organizes its network topology into a hierarchical structure consisting of racks, switches, and DataNodes. DataNodes within the same rack are considered close, while those in different racks are considered distant.

Answer 32

HDFS places an emphasis on data locality, aiming to schedule computation tasks on DataNodes that contain the data they need to process. By processing data locally, HDFS minimizes network traffic and improves performance.

Hadoop Ecosystem Fundamentals of Distributed Systems continued Flashcards

(56 cards)