Gfs and MapReduce Flashcards
What is GFS (Google File System)?
A distributed file system designed to handle enormous amounts of data (terabytes/petabytes) on commodity hardware ensuring performance scalability reliability and availability
What are the five key design assumptions of GFS?
1) Fault Tolerance 2) Large Files 3) Sequential Append-only Writes 4) Sequential Reads 5) High Bandwidth
What is the typical chunk size in GFS and why is it large?
64/128 MB; large chunks reduce master interaction optimize sequential operations and simplify space management
What are the three main components of GFS architecture?
1) Master 2) Chunkservers 3) Client
What is the role of the Master node in GFS?
Manages all file system metadata including namespace access permissions file-to-chunk mapping and chunk replica locations
What is the role of Chunkservers in GFS?
Store file data as chunks on local disks with each chunk replicated three times on different servers
What are the steps in a GFS read operation?
1) Client requests chunk index 2) Master returns chunk handle and locations 3) Client caches metadata 4) Client requests data from nearest replica 5) Chunkserver sends data
How does GFS ensure data integrity?
Uses checksums for 64 KB blocks with 32-bit checksums stored separately from user data; chunkservers verify checksums before serving data
What is MapReduce?
A high-level programming model and framework for parallel processing of large amounts of data on computer clusters
What are the two main functions in MapReduce?
1) MAP: processes input key-value pairs to produce intermediate pairs 2) REDUCE: aggregates values associated with same intermediate key
What are the two main phases of MapReduce execution?
1) Map Phase: parallel processing of input data 2) Reduce Phase: grouping and final processing of intermediate pairs
What are the two main daemons in Hadoop?
1) JobTracker: manages jobs and distributes tasks 2) TaskTracker: executes Map and Reduce tasks
How does Hadoop optimize data locality?
Tries to execute Map tasks on nodes that locally contain the data blocks to reduce network traffic
What is a Combiner in MapReduce?
An optional function that performs preliminary data aggregation during Map phase reducing data transferred to reducers
What are three common MapReduce applications?
1) Word Count 2) Inverted Index 3) Join operations