MapReduce Flashcards

Question

MapReduce Job Submission ____ is submitted from a client receives ____ Ensurues an output ______ computes input splits ____ Replicates job ____ file with the ___ across the cluster (10x)

Answer 1

MapReduce job is submitted from a client  Receives application ID  Ensures an output specification/directory  Completes input splits (parallel processing)  Replicates job JAR file with the ID across the cluster (10X)

Answer 2

Coordinates allocation of computing resources for the cluster  Schedules containers for MapReduce tasks  Engages the application master process in a node  Polls the Application Master every second for status

Answer 3

Single point of failure  High availability is achieved with a standby Resource Manager  Status is captured in HDFS or Zookeeper  Node manager is not stored in this recovery file

Answer 4

Coordinates the tasks running the MapReduce job  Runs in a container with a MapReduce Task  Keeps track of each nodes progress

Answer 5

 Application Master creates a map task for each input split  Creates a number of reduce tasks based on setting  Evaluates/assigns tasks to -- Under 10 mappers/1 reducer (1 block runs in self)  ---Large jobs assign more tasks in parallel  Sets up output job

Answer 6

Periodic heart beats are sent to the resource manager  If an update is not received, Resource Manager creates a new Application Master  AM may fail 2 times (default) before the job is failed

Answer 7

Launch and monitor compute containers on machines in the cluster

Answer 8

 Application Master requests containers from the Resource Manager  Each task is provided 1024 MB of memory and one processing core  Requests for Map tasks are fulfilled first with data locality  Requests for Reduce tasks are made after 5% map progress  Honors running reduce tasks on the same node as the map  If the tasks are not on the same rack the execution is honored

Answer 9

 Application Master contacts the Node Manager  Executes the task with local node resources  Server nodes run independently  Output committers write output files

Answer 10

 Java machine fails, node manager updates Application Master  Application Master kills tasks with no update over 10 min  AM reschedules failed tasks on different node  Tasks can only fail 4 times (configurable) then the whole job is failed  AM killed (not used) tasks do not count  Percentage of failures can also be configured

Answer 11

 Overall job status (running, successful, completed, failed)  Status on each task in the job (same ….)  Reading an input record  Map progress (% of task completed)  Reduce progress (estimate % of input processed)  Writing an output record

Answer 12

 Tasks report back to the Application Master  Application Master reports status to the client  On completion status Application Master and task containers clean up their working state  History of the task completion is archived

Answer 13

 Input (map) to every reducer has a key  Shuffle is the process used to sort and transfer map output to reducers

Answer 14

 100 MB memory buffer where output is written  At 80% (80 MB) the buffer begins the spill to disk  Continue in parallel, if buffer fills, map is blocked until spill is completed  Partitions are created based on the “to” reducer  In memory sort by key is performed within the partition  Spill files are merged into a single partitioned and sorted output file

Answer 15

Combiner uses the output of each partition sort  If there are 3 or more spill files a combiner is run

Answer 16

 Reducer asks Application Master for map output  Reduce tasks start copying their map outputs as soon as they are ready as the “copy phase”  Several threads can be copying in parallel

Answer 17

 After all map outputs are copied, sorting begins  Aligned by the “merge factor” look at chart

Answer 18

map to shuffle/sort to reduce look at chart

Answer 19

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically, the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. The MapReduce framework consists of a single master ResourceManager, one worker NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture Guide). Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the workers, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Although the Hadoop framework is implemented in Java, MapReduce applications need not be written in Java. LOOK AT PIC

Answer 20

(0,Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data \(multiterabyte data-sets\) in-parallel on large clusters \(thousands of nodes\) of commodity hardware in a reliable, fault-tolerant manner.) (246,A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a filesystem. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.) (653, Typically, the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.) LOOK AT PIC

Answer 21

 Number of mappers determined by input splits  Typically, we use one mapper for a small text file  Try to split at the block size staying on one node  Small splits will bog down a job with overhead  Map spills are written to local disk, not HDFS  Map output is not saved, only re-run if necessary

Answer 22

(Hadoop, 1) (MapReduce, 1) (is, 1) (a, 1) (software, 1) (framework, 1) (for, 1) (easily, 1) (writing, 1) (applications, 1) (which, 1) (process, 1) (vast, 1) (amounts, 1) (of, 1) (data, 1) (multi-terabyte, 1) (data-sets, 1) (in-parallel, 1) (on, 1) (large, 1) (clusters, 1) (thousands, 1) (of, 1) (nodes, 1) (of, 1) (commodity, 1) (hardware, 1) (in, 1) (a, 1) (reliable, 1) (fault-tolerant, 1) (manner, 1) LOOK AT PIC

Answer 23

(key, value) (of,[1,1,1,1]) (a,[1,1,1,1]) (writing,1) (which,1) (vast,1) (thousands,1) (software,1) (reliable,,1) (process,1) (on,1) (nodes,1) (multi-terabyte,1) (manner,1) (large,1) (is,1) (key, value) (in-parallel,1) (in,1) (hardware,1) (framework,1) (for,1) (fault-tolerant,1) (easily,1) (data-sets,1) (data,1) (commodity,1) (clusters,1) (applications,1) (amounts,1) (MapReduce,1) (Hadoop,1) LOOK AT PIC

Answer 24

 Number of reducers is determined by the user  There is a file written for each reducer/partition  Output of reducers is written to HDFS with 3 replications  If there are two partitions, how many files are saved?  Files are usually transferred across the network  Files will have more than one key, all files will have distinct keys

Answer 25

(of,4) (a,4) (writing,1) (which,1) (vast,1) (thousands,1) (software,1) (reliable,1) (process,1) (on,1) (nodes,1) (multiterabyte,1) (manner,1) (large,1) (is,1) 50 (in-parallel,1) (in,1) (hardware,1) (framework,1) (for,1) (faulttolerant,1) (easily,1) (data-sets,1) (data,1) (commodity,1) (clusters,1) (applications,1) (amounts,1) (MapReduce,1) (Hadoop,1) LOOK AT PIC

Answer 26

map--map--map-map | shuffle/sort | reduce--reduce--reduce -reduce LOOK AT PIC

Answer 27

MapReduce provides an efficient way to process large data sets in parallel  MapReduce is the source code for all big data applications  Mapping identifies Key Values  Reducing identifies specific output for a solution  Mapping writes to multiple nodes as close as possible  Mappers may provide duplicate files to each Reducer YARN provides a more scalable efficient approach to MapReduce  Resource Manager coordinates allocation of computing resources for the cluster  Application Master coordinates the tasks running the MapReduce job  Node Managers launch and monitor compute containers on machines in the cluster  Shuffle is the process used to sort and transfer map output to reducer

MapReduce Flashcards

(51 cards)