Gfs and MapReduce Flashcards

1
Q

What is GFS (Google File System)?

A

A distributed file system designed to handle enormous amounts of data (terabytes/petabytes) on commodity hardware ensuring performance scalability reliability and availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the five key design assumptions of GFS?

A

1) Fault Tolerance 2) Large Files 3) Sequential Append-only Writes 4) Sequential Reads 5) High Bandwidth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the typical chunk size in GFS and why is it large?

A

64/128 MB; large chunks reduce master interaction optimize sequential operations and simplify space management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the three main components of GFS architecture?

A

1) Master 2) Chunkservers 3) Client

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the role of the Master node in GFS?

A

Manages all file system metadata including namespace access permissions file-to-chunk mapping and chunk replica locations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the role of Chunkservers in GFS?

A

Store file data as chunks on local disks with each chunk replicated three times on different servers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the steps in a GFS read operation?

A

1) Client requests chunk index 2) Master returns chunk handle and locations 3) Client caches metadata 4) Client requests data from nearest replica 5) Chunkserver sends data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does GFS ensure data integrity?

A

Uses checksums for 64 KB blocks with 32-bit checksums stored separately from user data; chunkservers verify checksums before serving data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is MapReduce?

A

A high-level programming model and framework for parallel processing of large amounts of data on computer clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two main functions in MapReduce?

A

1) MAP: processes input key-value pairs to produce intermediate pairs 2) REDUCE: aggregates values associated with same intermediate key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two main phases of MapReduce execution?

A

1) Map Phase: parallel processing of input data 2) Reduce Phase: grouping and final processing of intermediate pairs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two main daemons in Hadoop?

A

1) JobTracker: manages jobs and distributes tasks 2) TaskTracker: executes Map and Reduce tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does Hadoop optimize data locality?

A

Tries to execute Map tasks on nodes that locally contain the data blocks to reduce network traffic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Combiner in MapReduce?

A

An optional function that performs preliminary data aggregation during Map phase reducing data transferred to reducers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are three common MapReduce applications?

A

1) Word Count 2) Inverted Index 3) Join operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does GFS handle write operations?

A

Uses a primary replica system where the primary decides write location assigns offset and coordinates with secondary replicas

17
Q

What is the purpose of the lease mechanism in GFS?

A

Ensures consistency and well-defined order of write operations by granting control to primary replica

18
Q

How does GFS handle replica failures during writes?

A

If some replicas fail the offset value is changed and the write process is restarted

19
Q

How to choose the number of Mappers and Reducers?

A

Make them larger than number of cluster nodes for load balancing and keep R (reducers) smaller than M (mappers) for efficient output