Gfs and MapReduce Flashcards

Question 1

Q

What is GFS (Google File System)?

Answer

A

A distributed file system designed to handle enormous amounts of data (terabytes/petabytes) on commodity hardware ensuring performance scalability reliability and availability

Question 2

Q

What are the five key design assumptions of GFS?

Answer

A

1) Fault Tolerance 2) Large Files 3) Sequential Append-only Writes 4) Sequential Reads 5) High Bandwidth

Question 3

Q

What is the typical chunk size in GFS and why is it large?

Answer

A

64/128 MB; large chunks reduce master interaction optimize sequential operations and simplify space management

Question 4

Q

What are the three main components of GFS architecture?

Answer

A

1) Master 2) Chunkservers 3) Client

Question 5

Q

What is the role of the Master node in GFS?

Answer

A

Manages all file system metadata including namespace access permissions file-to-chunk mapping and chunk replica locations

Question 6

Q

What is the role of Chunkservers in GFS?

Answer

A

Store file data as chunks on local disks with each chunk replicated three times on different servers

Question 7

Q

What are the steps in a GFS read operation?

Answer

A

1) Client requests chunk index 2) Master returns chunk handle and locations 3) Client caches metadata 4) Client requests data from nearest replica 5) Chunkserver sends data

Question 8

Q

How does GFS ensure data integrity?

Answer

A

Uses checksums for 64 KB blocks with 32-bit checksums stored separately from user data; chunkservers verify checksums before serving data

Question 9

Q

What is MapReduce?

Answer

A

A high-level programming model and framework for parallel processing of large amounts of data on computer clusters

Question 10

Q

What are the two main functions in MapReduce?

Answer

A

1) MAP: processes input key-value pairs to produce intermediate pairs 2) REDUCE: aggregates values associated with same intermediate key

Question 11

Q

What are the two main phases of MapReduce execution?

Answer

A

1) Map Phase: parallel processing of input data 2) Reduce Phase: grouping and final processing of intermediate pairs

Question 12

Q

What are the two main daemons in Hadoop?

Answer

A

1) JobTracker: manages jobs and distributes tasks 2) TaskTracker: executes Map and Reduce tasks

Question 13

Q

How does Hadoop optimize data locality?

Answer

A

Tries to execute Map tasks on nodes that locally contain the data blocks to reduce network traffic

Question 14

Q

What is a Combiner in MapReduce?

Answer

A

An optional function that performs preliminary data aggregation during Map phase reducing data transferred to reducers

Question 15

Q

What are three common MapReduce applications?

Answer

A

1) Word Count 2) Inverted Index 3) Join operations

Question 16

Q

How does GFS handle write operations?

Answer

A

Uses a primary replica system where the primary decides write location assigns offset and coordinates with secondary replicas

Question 17

Q

What is the purpose of the lease mechanism in GFS?

Answer

A

Ensures consistency and well-defined order of write operations by granting control to primary replica

Question 18

Q

How does GFS handle replica failures during writes?

Answer

A

If some replicas fail the offset value is changed and the write process is restarted

Question 19

Q

How to choose the number of Mappers and Reducers?

Answer

A

Make them larger than number of cluster nodes for load balancing and keep R (reducers) smaller than M (mappers) for efficient output