Exam Questions Flashcards
Take a GFS computing cluster with 300 chunk servers, each with a free disk of 10 TB, 10 GB of RAM available, a chunk size of 100MB, the standard replication factor of 3. About 10 KB of metadata needs to be stored per chunk handler.
You want to store a very big file on this cluster. What is the maximum file size that is possible to store on this cluster, and why?
You can get 1 PB (= 1000 TB = 10^15 B) of file on the disks, in 10^7 unique chunks.
You can only get metadata for 10^6 unique chunks in a RAM of 10 GB; this limits the file size to 100 TB (3%).
You need 10 KB metadata per unique chunk, not per replica (the replicas are the same, except e.g., location, version).
GFS: What metadata is kept by the master node?
Three major types of metadata: the file and chunk namespaces, the mapping from files to chunks and the locations of each chunk’s replicas.
GFS: In what type of memory is metadata stored on the master node, and why is this advantageous?
In RAM, but the file and chunk namespaces and mappings are also kept persistent in an operation log on the masters local disk and on a few remote machines for replicas.
If two or more applications want to overwrite the same data chunk (starting at the same offset) concurrently, what does GFS guarantee about the resulting chunk data?
The data will be consistent (all replicas are identical) but undefined (typically it consists of mingled fragments from multiple mutations).
GFS: A client can append to existing files. True or false, and explain why.
True, concurrently, and atomically.
GFS: Client applications need to know the chunk index in order to read or write in a file. True or false, and explain why.
True. The chunk index is based on a local file offset, however it is not to be confused with the chunk handle, which is a pointer that is retrieved from the master server.
GFS: The master keeps a record of all chunk locations at all times. True or false, and explain why.
Mostly true, but the chunk locations is in fact requested from the chunk servers at boot or when a new server joins the cluster and is kept in RAM.
List the names of the frameworks for distributed storage of big data that you know from the course.
GFS, Bigtable or their Apache alternatives.
For each such framework, what aspects of the system are indeed centralized.
GFS: The store for namespace, mapping, metadata; the first 2 steps in read / write operations, and the chunk management (migration, lease, garbage collection)
Bigtable: Store for location of the root tablet (reads the tablet assignment, but not their index); clients cache tablet locations, so no communication; only a global tablet management (like GFS).
For each such framework, what (if any) measures are taken when the “master” machine fails, to achieve fault tolerance?
For GFS, a shadow master is kept with the operation log and checkpoints of the master.
For Bigtable, 5 Chubby replicas are kept.*
Spark: Is an RDD partition big data?
No. An RDD can exceptionally be big if there are a lot of records per key. Initially, 1 partition = 1 chunk from disk.
List two ways in which the Spark map operation is different from the MapReduce map task.
Spark map is batch (inputs 1 batch = sets of tuples, and outputs 1 batch);
Does not necessarily read input from disk, nor save output to buffers;
Takes a function f as closure.
What is a task in Spark vocabulary?
Computation on partition, local inside an executor.
For a hypothetical Spark job executed on a cluster with the options –master yarn –deploy-mode cluster, estimate the number of computing cores that this job will occupy.
Depends on the number f free cores on the cluster, and the number of files and their size in the input.
It does not enforce the maximum number of executors, a dynamic allocation is done.
Why is a Spark join operation expensive?
It imposes a partitioning on both RDDs, and will likely need to repartition (shuffle) at least one of the inputs.
If a Spark join is really needed, what can the programmer do to make this operation less expensive?
Pre-partition both RDDs with the same partitioning function.
Storm & Spark Streaming: What is the most important difference between the basic data types processed on the cluster machines in these frameworks?
Spark Streaming process batches of records, whereas Storm process tuples.
Storm & Spark Streaming: How is intermediate data stored in these two frameworks?
Both locally in memory (but other configurations are possible).