batch_applications_flashcards

1
Q

What is Apache YARN?

A

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system, introduced to improve MapReduce by managing and allocating cluster resources for distributed computing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the role of the Resource Manager (RM) in YARN?

A

The RM manages resource allocation across all applications in the cluster, ensures fault tolerance, and uses scheduling policies to manage concurrency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Node Manager (NM) in YARN?

A

The NM is a per-node agent responsible for monitoring resource usage of containers and reporting to the RM. It launches and manages containers on each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a container in YARN?

A

A container is an abstraction used to run application-specific processes with allocated CPU and memory resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Application Master (AMP) in YARN?

A

The AMP controls the execution of a specific application, manages its lifecycle, and communicates with the RM to request resources and launch tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the benefits of using YARN for big data applications?

A

YARN provides scalability, flexibility for running multiple jobs, improved resource utilization, and fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data locality in YARN?

A

Data locality means moving code to the data instead of moving data to the code, optimizing processing speed by reducing data transfer across nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the three main types of schedulers in YARN?

A

The FIFO Scheduler, Capacity Scheduler, and Fair Scheduler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the FIFO Scheduler in YARN operate?

A

It processes jobs in the order they arrive, which is simple but not ideal for shared clusters due to lack of resource balancing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Capacity Scheduler in YARN?

A

A scheduler that allocates a reserved amount of resources to each job to ensure predictable job performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Fair Scheduler in YARN?

A

A scheduler that dynamically balances resources across all running jobs to ensure fairness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does YARN handle job submission and execution?

A

A client submits the job, the RM allocates a container for the AMP, which then requests additional containers for task execution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the fault tolerance mechanism in YARN?

A

The AMP sets failed tasks to idle and reschedules them. If the AMP fails, the RM can use job history to recover the state or restart the job if necessary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the main drawback of Hadoop’s MapReduce?

A

It is slow due to disk I/O, writes output to HDFS after each job, and is not suitable for interactive processing or iterative algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the advantages of Apache Spark over Hadoop MapReduce?

A

Spark offers better RAM usage, faster data processing, support for iterative algorithms, and a unified framework for batch and real-time processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an RDD in Apache Spark?

A

A Resilient Distributed Dataset (RDD) is an immutable distributed collection of objects partitioned across cluster nodes, supporting operations like transformations and actions.

17
Q

What are transformations and actions in Spark?

A

Transformations create a new RDD from an existing one, while actions trigger computation and return a result or save output.

18
Q

What is lazy evaluation in Spark?

A

Spark delays the computation of transformations until an action is called, optimizing execution by applying transformations only when necessary.

19
Q

What is the Directed Acyclic Graph (DAG) in Spark?

A

A DAG represents the sequence of computations performed on RDDs, with nodes as RDDs and edges as operations. It is transformed into a physical execution plan for task scheduling.

20
Q

How does Spark’s architecture differ from MapReduce?

A

Spark uses a driver program and executors that run multiple tasks, leveraging memory for intermediate data storage, whereas MapReduce writes intermediate results to disk.

21
Q

What is YARN’s role in running Spark jobs?

A

YARN acts as the cluster manager that allocates resources for Spark jobs and monitors their execution, with Spark’s driver handling task assignments to executors.

22
Q

What is data caching in Spark?

A

Caching stores RDDs in memory, improving performance by avoiding recomputation for repeated access.