Session 5.2 Flashcards

1
Q

Desired Properties of a Big Data System

A

1 Robustness and fault tolerance
2 Low latency
3 Minimal Maintenance
4 Ad hoc queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem

A
  • There is no single tool that provides a complete solution

- Instead, you have to use a variety of tools and techniques to build a complete Big Data system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The main idea of the Lambda Architecture is to…

A

build Big Data systems as a series of layers

Each layer satisfies a subset of the desired properties and builds upon the functionality provided by the layers above it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Storing data in raw format has many advantages:

A
  1. Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data
  2. You can always go back to the data and perform queries you did not anticipate when building the system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data should be stored in raw format, should be

A
  1. Immutable

2. Kept forever

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers

A
  • They scale by adding more machines to the cluster
  • Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The operations you can do with a distributed filesystem are often…

A

more limited than you can do with a regular filesystem

For instance, you may not be able to write to the middle of a file or even modify a file at all after creation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Google needed a good distributed file system, why?

A

Redundant storage of massive amounts of data on cheap and unreliable computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why did Google not use an existing file system?

A
  • Google’s problems were different from anyone else’s
  • > Different workload and design priorities
  • Google File System is designed for Google apps and workloads
  • Google apps are designed for Google File System
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Google File System - Assumptions

A
  1. High component failure rates
  2. “Modest” number of HUGE files
  3. Files are write-once, mostly appended to
  4. Large streaming reads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bigtable is…

A

a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The Split-Apply-Combine Approach

Split

A

Break up a big problem into manageable pieces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The Split-Apply-Combine Approach

Apply

A

Operate on each piece independently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The Split-Apply-Combine Approach

Combine

A

Put all the pieces back together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The MapReduce model consists of two main stages

Map

A

input data is split into discrete chunks to be processed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The MapReduce model consists of two main stages

Reduce

A

output of the map phase is aggregated to produce the desired result

17
Q

MapReduce

The simple nature of the programming model lends itself to…

A

efficient and large-scale implementations across thousands of cheap nodes (computers).

  • efficient
  • large-scale implementations
18
Q

Key benefits of MapReduce

A
  1. Simplicity
  2. Scalability
  3. Speed
  4. Recovery
  5. Minimal data motion
19
Q

Key benefits of MapReduce

Simplicity

A

Developers can write applications in their language of choice, such as Java, C++ or Python

20
Q

Key benefits of MapReduce

Scalability

A

MapReduce can very large amounts of data, stored in HDFS on one cluster

21
Q

Key benefits of MapReduce

Speed

A

Parallel processing means that MapReduce can take problems that used to take days to solve and solve them in hours or minutes

22
Q

Limitations of MapReduce

A
  1. MapReduce is designed specifically for batch processing

2. Low level framework (hard to use)

23
Q

New tools have been developed to simplify the use of MapReduce

A
  • Apache HIVE (similar to SQL)

- Apache Pig (script language)