batch_applications_flashcards

Question 1

Q

What is Apache YARN?

Answer

A

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system, introduced to improve MapReduce by managing and allocating cluster resources for distributed computing.

Question 2

Q

What is the role of the Resource Manager (RM) in YARN?

Answer

A

The RM manages resource allocation across all applications in the cluster, ensures fault tolerance, and uses scheduling policies to manage concurrency.

Question 3

Q

What is the Node Manager (NM) in YARN?

Answer

A

The NM is a per-node agent responsible for monitoring resource usage of containers and reporting to the RM. It launches and manages containers on each node.

Question 4

Q

What is a container in YARN?

Answer

A

A container is an abstraction used to run application-specific processes with allocated CPU and memory resources.

Question 5

Q

What is the Application Master (AMP) in YARN?

Answer

A

The AMP controls the execution of a specific application, manages its lifecycle, and communicates with the RM to request resources and launch tasks.

Question 6

Q

What are the benefits of using YARN for big data applications?

Answer

A

YARN provides scalability, flexibility for running multiple jobs, improved resource utilization, and fault tolerance.

Question 7

Q

What is data locality in YARN?

Answer

A

Data locality means moving code to the data instead of moving data to the code, optimizing processing speed by reducing data transfer across nodes.

Question 8

Q

What are the three main types of schedulers in YARN?

Answer

A

The FIFO Scheduler, Capacity Scheduler, and Fair Scheduler.

Question 9

Q

How does the FIFO Scheduler in YARN operate?

Answer

A

It processes jobs in the order they arrive, which is simple but not ideal for shared clusters due to lack of resource balancing.

Question 10

Q

What is the Capacity Scheduler in YARN?

Answer

A

A scheduler that allocates a reserved amount of resources to each job to ensure predictable job performance.

Question 11

Q

What is the Fair Scheduler in YARN?

Answer

A

A scheduler that dynamically balances resources across all running jobs to ensure fairness.

Question 12

Q

How does YARN handle job submission and execution?

Answer

A

A client submits the job, the RM allocates a container for the AMP, which then requests additional containers for task execution.

Question 13

Q

What is the fault tolerance mechanism in YARN?

Answer

A

The AMP sets failed tasks to idle and reschedules them. If the AMP fails, the RM can use job history to recover the state or restart the job if necessary.

Question 14

Q

What is the main drawback of Hadoop’s MapReduce?

Answer

A

It is slow due to disk I/O, writes output to HDFS after each job, and is not suitable for interactive processing or iterative algorithms.

Question 15

Q

What are the advantages of Apache Spark over Hadoop MapReduce?

Answer

A

Spark offers better RAM usage, faster data processing, support for iterative algorithms, and a unified framework for batch and real-time processing.

Question 16

Q

What is an RDD in Apache Spark?

Answer

Study These Flashcards

A

A Resilient Distributed Dataset (RDD) is an immutable distributed collection of objects partitioned across cluster nodes, supporting operations like transformations and actions.

Question 17

Q

What are transformations and actions in Spark?

Answer

Study These Flashcards

A

Transformations create a new RDD from an existing one, while actions trigger computation and return a result or save output.

Question 18

Q

What is lazy evaluation in Spark?

Answer

Study These Flashcards

A

Spark delays the computation of transformations until an action is called, optimizing execution by applying transformations only when necessary.

Question 19

Q

What is the Directed Acyclic Graph (DAG) in Spark?

Answer

Study These Flashcards

A

A DAG represents the sequence of computations performed on RDDs, with nodes as RDDs and edges as operations. It is transformed into a physical execution plan for task scheduling.

Question 20

Q

How does Spark’s architecture differ from MapReduce?

Answer

Study These Flashcards

A

Spark uses a driver program and executors that run multiple tasks, leveraging memory for intermediate data storage, whereas MapReduce writes intermediate results to disk.

Question 21

Q

What is YARN’s role in running Spark jobs?

Answer

Study These Flashcards

A

YARN acts as the cluster manager that allocates resources for Spark jobs and monitors their execution, with Spark’s driver handling task assignments to executors.

Question 22

Q

What is data caching in Spark?

Answer

Study These Flashcards

A

Caching stores RDDs in memory, improving performance by avoiding recomputation for repeated access.

batch_applications_flashcards

(22 cards)