batch_applications_flashcards
What is Apache YARN?
Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system, introduced to improve MapReduce by managing and allocating cluster resources for distributed computing.
What is the role of the Resource Manager (RM) in YARN?
The RM manages resource allocation across all applications in the cluster, ensures fault tolerance, and uses scheduling policies to manage concurrency.
What is the Node Manager (NM) in YARN?
The NM is a per-node agent responsible for monitoring resource usage of containers and reporting to the RM. It launches and manages containers on each node.
What is a container in YARN?
A container is an abstraction used to run application-specific processes with allocated CPU and memory resources.
What is the Application Master (AMP) in YARN?
The AMP controls the execution of a specific application, manages its lifecycle, and communicates with the RM to request resources and launch tasks.
What are the benefits of using YARN for big data applications?
YARN provides scalability, flexibility for running multiple jobs, improved resource utilization, and fault tolerance.
What is data locality in YARN?
Data locality means moving code to the data instead of moving data to the code, optimizing processing speed by reducing data transfer across nodes.
What are the three main types of schedulers in YARN?
The FIFO Scheduler, Capacity Scheduler, and Fair Scheduler.
How does the FIFO Scheduler in YARN operate?
It processes jobs in the order they arrive, which is simple but not ideal for shared clusters due to lack of resource balancing.
What is the Capacity Scheduler in YARN?
A scheduler that allocates a reserved amount of resources to each job to ensure predictable job performance.
What is the Fair Scheduler in YARN?
A scheduler that dynamically balances resources across all running jobs to ensure fairness.
How does YARN handle job submission and execution?
A client submits the job, the RM allocates a container for the AMP, which then requests additional containers for task execution.
What is the fault tolerance mechanism in YARN?
The AMP sets failed tasks to idle and reschedules them. If the AMP fails, the RM can use job history to recover the state or restart the job if necessary.
What is the main drawback of Hadoop’s MapReduce?
It is slow due to disk I/O, writes output to HDFS after each job, and is not suitable for interactive processing or iterative algorithms.
What are the advantages of Apache Spark over Hadoop MapReduce?
Spark offers better RAM usage, faster data processing, support for iterative algorithms, and a unified framework for batch and real-time processing.