Session 5.2 Flashcards

Question 1

Q

Desired Properties of a Big Data System

Answer

A

1 Robustness and fault tolerance
2 Low latency
3 Minimal Maintenance
4 Ad hoc queries

Question 2

Q

Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem

Answer

A

There is no single tool that provides a complete solution

- Instead, you have to use a variety of tools and techniques to build a complete Big Data system

Question 3

Q

The main idea of the Lambda Architecture is to…

Answer

A

build Big Data systems as a series of layers

Each layer satisfies a subset of the desired properties and builds upon the functionality provided by the layers above it

Question 4

Q

Storing data in raw format has many advantages:

Answer

A

Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data
You can always go back to the data and perform queries you did not anticipate when building the system

Question 5

Q

Data should be stored in raw format, should be

Answer

A

Immutable

2. Kept forever

Question 6

Q

Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers

Answer

A

They scale by adding more machines to the cluster
Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible

Question 7

Q

The operations you can do with a distributed filesystem are often…

Answer

A

more limited than you can do with a regular filesystem

For instance, you may not be able to write to the middle of a file or even modify a file at all after creation

Question 8

Q

Google needed a good distributed file system, why?

Answer

A

Redundant storage of massive amounts of data on cheap and unreliable computers

Question 9

Q

Why did Google not use an existing file system?

Answer

A

Google’s problems were different from anyone else’s
> Different workload and design priorities
Google File System is designed for Google apps and workloads
Google apps are designed for Google File System

Question 10

Q

Google File System - Assumptions

Answer

A

High component failure rates
“Modest” number of HUGE files
Files are write-once, mostly appended to
Large streaming reads

Question 11

Q

Bigtable is…

Answer

A

a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Question 12

Q

The Split-Apply-Combine Approach

Split

Answer

A

Break up a big problem into manageable pieces

Question 13

Q

The Split-Apply-Combine Approach

Apply

Answer

A

Operate on each piece independently

Question 14

Q

The Split-Apply-Combine Approach

Combine

Answer

A

Put all the pieces back together

Question 15

Q

The MapReduce model consists of two main stages

Map

Answer

A

input data is split into discrete chunks to be processed

Question 16

Q

The MapReduce model consists of two main stages

Reduce

Answer

Study These Flashcards

A

output of the map phase is aggregated to produce the desired result

Question 17

Q

MapReduce

The simple nature of the programming model lends itself to…

Answer

Study These Flashcards

A

efficient and large-scale implementations across thousands of cheap nodes (computers).

efficient
large-scale implementations

Question 18

Q

Key benefits of MapReduce

Answer

Study These Flashcards

A

Simplicity
Scalability
Speed
Recovery
Minimal data motion

Question 19

Q

Key benefits of MapReduce

Simplicity

Answer

Study These Flashcards

A

Developers can write applications in their language of choice, such as Java, C++ or Python

Question 20

Q

Key benefits of MapReduce

Scalability

Answer

Study These Flashcards

A

MapReduce can very large amounts of data, stored in HDFS on one cluster

Question 21

Q

Key benefits of MapReduce

Speed

Answer

Study These Flashcards

A

Parallel processing means that MapReduce can take problems that used to take days to solve and solve them in hours or minutes

Question 22

Q

Limitations of MapReduce

Answer

Study These Flashcards

A

MapReduce is designed specifically for batch processing

2. Low level framework (hard to use)

Question 23

Q

New tools have been developed to simplify the use of MapReduce

Answer

Study These Flashcards

A

Apache HIVE (similar to SQL)

- Apache Pig (script language)

Session 5.2 Flashcards

(23 cards)