YARN & MapReduce Flashcards
What is YARN?
Stands for “Yet Another Resource Manager”. It is a resource management layer in Hadoop. It enables you to manage resources and schedule jobs in Hadoop. You can use various data processing engines for batch, interactive and real-time processing of data stored in HDFS.
Explain YARN architecture
It has 3 main components:
ResourceManager: Allocates cluster resources using a Scheduler (allocates resources) and ApplicationManager (accepts jobs submissions)
NodeManager: Manages jobs or workflow in a specific node by creating and destroying containers in a cluster node.
ApplicationMaster: Manages the life-cycle of a job by directing the NodeManager to create or destroy a container for a job.
What are the YARN Schedulers?
FIFO Scheduler: Places application in queue and runs them in the order of submission (first in, first out)
Capacity Scheduler: Divides cluster capacity. It maintains a separate queue for small and larger jobs.
Fair Scheduler: Dynamically balances the resources into all accepted jobs.
What is MapReduce and why it’s important?
It is a distributed execution framework within the Hadoop ecosystem. It takes away the complexity of distributed programming by exposing two processing steps: Map and Reduce. In the Mapping step, data is split between parallel processing tasks. Transformation logic can be applied to each chunk of data. Once completed, the Reduce phase takes over to handle aggregating data from the Map set.
How MapReduce works?
Suppose you want to know how many times the words Cat and Dog appears in 1M documents.
Map
The Map phase will first split those documents into smaller blocks shared across all nodes in the cluster. Each block is then assigned to a mapper for processing. In this case the map function will count how many times those words appear in each block, and output a key/value pairs. For example in block #1, “CAT”: 5 and “DOG”: 6, where in block #2, “CAT”: 2 and “DOG”: 6
Shuffle
Sorts the keys so that all data belonging to one key is located on the same worker node.
Reduce
Combines all keys together and aggregates the values. So “CAT”: 7 and “DOG”: 12
What is spilling in MapReduce?
Process of copying the data from memory buffer to disk after a certain threshold is reached
What is the different between blocks, input splits and records?
Blocks (Physical Division): Data in HDFS is stored as blocks.
Input Splits (Logical Division): Logical chunks of data to be processed by an individual mapper
Records (Logical Division): Each input split is comprised of records (e.g in a text file, each line is a record)
What is the role of the RecordReader in MapReduce?
Converts the data present in a file into key/value pairs suitable for reading by the Mapper Task
What is the role of Application Master in MapReduce?
Initializes the job and tracks the job progress. Negotiates the resources needed for running a job with the ResourceManager
When using YARN, what happens when you submit a job?
ResourceManager requests the NodeManager to hold some resources for processing. NodeManager then guarantees the container that would be available for processing. Next, the ResourceManager starts a temporary daemon called ApplicationMaster to take care of the execution. The AppMaster, which the ApplicationMaster launches, will run in one of the containers.