Hadoop Ecosystem Fundamentals of Distributed Systems continued Flashcards

1
Q

What is Hadoop Distributed File System (HDFS)?

A

Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop for storing large datasets across clusters of commodity hardware.
It is designed for reliability, scalability, and fault tolerance, with data distributed across multiple nodes in the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is MapReduce in Hadoop?

A

MapReduce is a programming model and processing engine used by Hadoop for distributed data processing.
It consists of two main phases: the Map phase for processing input data in parallel across multiple nodes, and the Reduce phase for aggregating and processing intermediate results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is YARN in Hadoop?

A

YARN is a resource management and job scheduling framework in Hadoop.
It decouples resource management from job scheduling, allowing multiple data processing engines (such as MapReduce, Apache Spark, and Apache Flink) to run on the same cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Hadoop ecosystem?

A

The Hadoop ecosystem refers to the collection of open-source projects and tools built around the Hadoop core components, including HDFS, MapReduce, and YARN.
It includes projects for data ingestion, processing, storage, querying, and visualization, such as Apache Hive, Apache HBase, Apache Spark, Apache Pig, and Apache Kafka.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the key components of the Hadoop architecture?

A

The key components of the Hadoop architecture include HDFS for storage, MapReduce for processing, and YARN for resource management.
Additional components in the Hadoop ecosystem provide functionalities such as SQL querying (Apache Hive), real-time processing (Apache Kafka), NoSQL databases (Apache HBase), and machine learning (Apache Spark MLlib).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are nodes in Hadoop Distributed File System (HDFS)?

A

Nodes in HDFS refer to the individual machines that make up the Hadoop cluster.
Each node typically consists of commodity hardware and serves a specific role in the Hadoop ecosystem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the NameNode in HDFS?

A

The NameNode is the master node in HDFS that manages the file system namespace and controls access to files by clients.
It stores metadata about files and directories, such as file permissions, ownership, and block locations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a DataNode in HDFS?

A

DataNodes are worker nodes in HDFS responsible for storing and managing the actual data blocks of files.
They store data on the local filesystem and communicate with the NameNode to report storage capacity and block status.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Secondary NameNode in HDFS?

A

The Secondary NameNode is a helper node in HDFS that performs periodic checkpoints of the NameNode’s namespace and edits log.
It does not act as a backup for the NameNode but helps in reducing the time taken for NameNode recovery in case of failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are data blocks distributed across DataNodes in HDFS?

A

Data blocks are replicated and distributed across multiple DataNodes in the HDFS cluster to ensure fault tolerance and high availability.
The replication factor determines the number of copies of each block, and blocks are placed on different racks to minimize data transfer over the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Edit Log in Hadoop’s HDFS?

A

The Edit Log is a file in Hadoop’s HDFS that records all modifications made to the file system namespace metadata, such as file creations, deletions, and modifications.
It acts as a persistent transaction log, allowing the system to recover the file system’s state in the event of a failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the FSImage in Hadoop’s HDFS?

A

The FSImage is a snapshot of the file system namespace metadata at a particular point in time in Hadoop’s HDFS.
It contains information about the directory structure, file permissions, ownership, and block locations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are Edit Log and FSImage used in checkpointing?

A

In Hadoop’s HDFS, the Edit Log and FSImage are used together in a process called checkpointing.
During checkpointing, the current state of the file system metadata in memory is written to the FSImage, and the Edit Log is cleared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How are Edit Log and FSImage used in recovery?

A

In the event of a failure, Hadoop’s HDFS uses the FSImage and the Edit Log to recover the file system’s state.
The FSImage provides a consistent snapshot of the file system metadata, while the Edit Log contains a record of all transactions since the last checkpoint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the impact of Edit Log and FSImage on HDFS performance?

A

The frequent writing of transactions to the Edit Log can impact HDFS performance due to disk I/O.
However, periodic checkpointing and optimizations in the way transactions are written can help mitigate this impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the NameNode in Hadoop’s HDFS?

A

The NameNode is a critical component of Hadoop’s HDFS and is a service responsible for managing the metadata of the file system.
It maintains the namespace hierarchy, file permissions, and mapping of data blocks to DataNodes.
The NameNode is a single point of failure in the HDFS architecture, and its high availability is ensured through mechanisms like standby NameNodes and automatic failover.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does a write occur in HDFS?

A

A write in HDFS begins with a client sending a write request to the NameNode, specifying the file to be written and the data to be written to that file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What happens after the client sends a write request to the NameNode in HDFS?

A

Upon receiving the write request, the NameNode determines the list of DataNodes where the data blocks will be stored.
It allocates a list of suitable DataNodes for the data blocks based on factors such as availability, proximity, and load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How does the client write data to HDFS after block allocation?

A

Once the block allocation is complete, the client establishes direct connections with the chosen DataNodes.
It streams the data directly to the selected DataNodes in the form of data packets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What happens to the data blocks after they are written to the primary DataNodes?

A

After the primary DataNodes receive the data blocks, they replicate the data to additional DataNodes according to the configured replication factor.
Replication ensures fault tolerance and data availability by storing multiple copies of each block on different nodes in the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does the client know the write operation was successful in HDFS?

A

After successfully writing all data blocks to the primary and replicated DataNodes, the client receives acknowledgment messages from the DataNodes.
Once the client receives acknowledgments from a sufficient number of DataNodes, it considers the write operation successful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is YARN in Hadoop 2.x architecture?

A

YARN is a resource management and job scheduling framework introduced in Hadoop 2.x.
It decouples the resource management and job scheduling functionalities of Hadoop MapReduce, allowing multiple data processing frameworks to run on the same Hadoop cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the role of ResourceManager in Hadoop 2.x architecture?

A

The ResourceManager is the master daemon in YARN responsible for managing and allocating cluster resources to various applications.
It maintains information about available cluster resources, node health, and application resource requests.

24
Q

What is the role of NodeManager in Hadoop 2.x architecture?

A

NodeManagers are worker daemons in YARN responsible for managing resources on individual nodes in the cluster.
They report node resource utilization and health to the ResourceManager and manage the execution of application containers on their respective nodes.

25
Q

What is the ApplicationMaster in Hadoop 2.x architecture?

A

The ApplicationMaster is a per-application framework-specific master daemon responsible for negotiating resources from the ResourceManager and managing the execution of application tasks.
Each application running on the cluster has its own ApplicationMaster, which coordinates the execution of tasks across multiple containers.

26
Q

What are some data processing frameworks supported by Hadoop 2.x architecture?

A

Hadoop 2.x architecture supports multiple data processing frameworks, including MapReduce, Apache Spark, Apache Flink, Apache Tez, and others.
These frameworks can run on top of YARN, utilizing its resource management capabilities for efficient and flexible data processing.

27
Q

What is ZooKeeper?

A

ZooKeeper is a distributed coordination service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
It is designed to be highly available, fault-tolerant, and scalable.

28
Q

What are some common use cases for ZooKeeper?

A

ZooKeeper is commonly used for distributed systems coordination, such as leader election, distributed locking, configuration management, and maintaining consistent metadata.
It is also used in distributed messaging systems, distributed databases, and other distributed applications requiring coordination and synchronization.

29
Q

How does ZooKeeper achieve its goals?

A

ZooKeeper employs a hierarchical namespace called ZNodes, which are similar to files and directories in a file system.
It uses a consensus protocol (ZAB - ZooKeeper Atomic Broadcast) for maintaining consistency and coordination among distributed nodes.
ZooKeeper ensemble consists of multiple nodes (ZooKeeper servers) forming a quorum for fault tolerance and high availability.

30
Q

What is a quorum in ZooKeeper?

A

A quorum in ZooKeeper refers to a majority of nodes in the ZooKeeper ensemble.
A quorum of nodes must agree on changes to the state of the system for operations to proceed, ensuring consistency and fault tolerance.

31
Q

How do applications interact with ZooKeeper?

A

Applications interact with ZooKeeper using its client API, which provides primitives for creating, reading, updating, and deleting ZNodes.
ZooKeeper clients establish connections to one or more ZooKeeper servers in the ensemble and receive notifications about changes to the ZNodes they are interested in.

32
Q

How does ZooKeeper handle authentication?

A

ZooKeeper supports various authentication mechanisms, including Kerberos, digest, and SSL/TLS.
Clients connecting to ZooKeeper servers must authenticate themselves using one of these mechanisms before accessing data.

33
Q

What is authorization in ZooKeeper?

A

Authorization in ZooKeeper involves controlling access to ZNodes based on user roles and permissions.
ZooKeeper uses an Access Control List (ACL) model to define permissions for each ZNode, specifying which users or groups have read, write, and administration privileges.

34
Q

How are ACLs used in ZooKeeper?

A

ACLs are attached to each ZNode in ZooKeeper to specify who has access to that node and what operations they are allowed to perform.
ACLs consist of a list of permissions (e.g., read, write, create, delete) associated with individual users or groups.

35
Q

How does ZooKeeper ensure data privacy?

A

ZooKeeper supports encryption of data in transit using SSL/TLS protocols.
By enabling encryption, ZooKeeper ensures that data exchanged between clients and servers is encrypted, preventing eavesdropping and tampering.

36
Q

How does ZooKeeper ensure secure communication between clients and servers?

A

ZooKeeper uses SSL/TLS to encrypt communication between clients and servers, providing confidentiality and integrity.
Clients can configure ZooKeeper to use SSL/TLS for secure communication by enabling encryption in the client configuration.

37
Q

What is Active-Passive High Availability architecture?

A

In an Active-Passive HA architecture, there are two or more redundant systems, but only one is active at a time.
The passive systems remain in standby mode, ready to take over if the active system fails.
Failover to the passive system typically requires manual intervention or automated detection of failure.

38
Q

What is Active-Active High Availability architecture?

A

In an Active-Active HA architecture, multiple redundant systems are active and serving traffic simultaneously.
Traffic is distributed among the active systems, allowing for load balancing and scalability.
Each system is capable of handling the entire workload independently, providing fault tolerance and high availability.

39
Q

What is N+1 Redundancy in High Availability architecture?

A

N+1 Redundancy refers to having one extra redundant system or component beyond what is necessary to handle the expected load.
If one system fails, the extra redundant system can take over to maintain service availability without interruption.
N+1 redundancy provides fault tolerance and ensures uninterrupted service even during system failures.

40
Q

How does Active-Standby High Availability architecture work?

A

In an Active-Standby HA architecture, there is one active system serving traffic and one or more standby systems in standby mode.
The standby systems are ready to take over if the active system fails, but they do not serve traffic under normal conditions.
Failover to the standby system can be manual or automatic, depending on the configuration and requirements.

41
Q

How does Load Balancing contribute to High Availability?

A

Load Balancing distributes incoming traffic across multiple servers or resources to ensure optimal resource utilization and prevent overload.
By distributing traffic among redundant systems, load balancing enhances fault tolerance and improves system reliability and availability.

42
Q

What is erasure coding?

A

Erasure coding is a method of data protection and redundancy in which data is divided into fragments, expanded, and encoded with redundant information.
This redundancy allows the original data to be reconstructed even if some fragments are lost or corrupted.

43
Q

How does erasure coding provide redundancy?

A

Erasure coding generates additional redundant fragments, known as parity or checksums, alongside the original data fragments.
These redundant fragments contain information that enables the reconstruction of the original data, even if some fragments are missing or damaged.

44
Q

How does erasure coding contribute to fault tolerance?

A

Erasure coding enhances fault tolerance by allowing data to be reconstructed from remaining fragments in the event of fragment loss or corruption.
The redundancy introduced by erasure coding reduces the likelihood of data loss or unavailability due to hardware failures or data corruption.

45
Q

How does erasure coding affect storage efficiency?

A

Erasure coding typically requires less storage overhead compared to traditional replication-based redundancy schemes.
It achieves higher storage efficiency by generating fewer redundant fragments while still providing the same level of fault tolerance.

46
Q

What are some common use cases for erasure coding?

A

Erasure coding is commonly used in distributed storage systems, cloud storage, object storage, and archival systems where data durability and reliability are essential.
It is particularly well-suited for large-scale storage deployments requiring efficient use of storage resources and high fault tolerance.

47
Q

How does Hadoop Distributed File System (HDFS) distribute data across the cluster?

A

HDFS divides large files into fixed-size blocks, typically 128 MB or 256 MB in size.
Each block is replicated across multiple DataNodes in the cluster to ensure fault tolerance and data availability

48
Q

What determines the number of replicas created for each data block in HDFS?

A

The replication factor is a configurable parameter in HDFS that determines the number of replicas created for each data block.
The default replication factor is usually 3, meaning that each block is replicated three times across the cluster.

49
Q

How does HDFS decide which DataNodes to replicate data blocks to?

A

HDFS aims to distribute data blocks across different racks and DataNodes to minimize data transfer over the network and improve fault tolerance.
The replication algorithm takes into account the proximity of DataNodes to optimize data placement and ensure data locality.

50
Q

What is Rack Awareness in HDFS?

A

Rack Awareness is a feature of HDFS that allows it to be aware of the network topology, including racks and DataNodes’ locations within the racks.
HDFS uses Rack Awareness to ensure that replicas of data blocks are stored on different racks for fault tolerance and network efficiency.

51
Q

How does HDFS assign data blocks to DataNodes?

A

When a client writes data to HDFS, the NameNode selects a set of DataNodes to store replicas of the data blocks.
The selection considers factors such as DataNode availability, rack proximity, and the desired replication factor.

52
Q

What are racks in Hadoop Distributed File System (HDFS) topology?

A

Racks are physical network switches or cabinets that contain multiple DataNodes in a Hadoop cluster.
Racks are the basic unit of network topology in HDFS, and DataNodes within the same rack are typically connected by high-speed local network links.

53
Q

What are DataNodes in HDFS topology?

A

DataNodes are individual machines in a Hadoop cluster that store data and perform data processing tasks.
DataNodes are organized into racks based on their physical location in the data center.

54
Q

Where does the NameNode reside in HDFS topology?

A

The NameNode, which manages the metadata of the file system, is typically located outside the racks containing DataNodes.
This placement ensures that the NameNode remains accessible even if an entire rack or set of racks goes offline.

55
Q

How does HDFS organize its network topology?

A

HDFS organizes its network topology into a hierarchical structure consisting of racks, switches, and DataNodes.
DataNodes within the same rack are considered close, while those in different racks are considered distant.

56
Q

How does HDFS topology contribute to data locality?

A

HDFS places an emphasis on data locality, aiming to schedule computation tasks on DataNodes that contain the data they need to process.
By processing data locally, HDFS minimizes network traffic and improves performance.