MapReduce And ACID vs BASE Flashcards

1
Q

MapReduce

A

The way that Google splits the job of finding data into tasks for separate machines

Like a distributed GROUP BY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Phases

A

Map phase: Find and aggregate key values on each node. done on all nodes in parallel
Shuffle phase: group all like categories in each node
Reduce phase: totals the number of things in each category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ACID properties of big data

A

Atomicity
Consistency
Isolation
Durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Atomicity

A

All or nothing. In a transaction all the operations must succeed or fail as a group
-> ensures data integrity

Achieved through COMMIT and ROLLBACK transactions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Consistency

A

Data must be consistent before and after a transaction

Achieved through Forward recovery, backward recovery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Isolation

A

Transactions never interfere with each other

Achieved through Locking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Durability

A

Transactions are permanent even if they fail

Achieved through: Forward recovery, backward recovery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Base properties (a great alternative only if query results can handle some inconsistencies)

A

Basic Availability
Soft state
Eventual Consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Basic availability

A

A Big data alternative that makes it able to tolerate partial failure (failure of a node)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Soft state

A

State of the system is in flux and may change over time
(Correctness of big data is not that important)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Eventual consistency

A

May not be consistent in the short run but will eventually become consistent as more data is added

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hadoop ecosystem

A

Tools that make Hadoop easy to use for people without Java programming skills.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Elements of the Hadoop ecosystem

A

Hive, Sqoop, Pig, Flume, Hbase, Impala

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Hive

A

DW system that works with HDFS and it’s not relational.
HiveQL is a declarative (what) SQL like query language. Processes queries into MapReduce jobs.
Works best on large sets of data; doesn’t return small sets of data quickly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Pig

A

Pig Latin scripting language. Procedural (how).
Compiles pig Latin into MapReduce jobs.
Good for data transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Flume

A

(Web click streams one)
Harvests large sets of data from server log files. Can be configured to import data on a regular schedule and can move data into HDFS.

17
Q

Sqoop

A

SQL to Hadoop
Converts data back and forth between a relational dbms and Hadoop

18
Q

HBase

A

A NoSQl database that works directly with HDFS
Does not rely on Map-Reduce
Suitable for fast processing of small data sets
Very good at quickly processing sparse data sets
Used for Facebook messaging system

19
Q

Impala

A

Supports SQL queries that pull data directly from HDFS
Works well for processing large datasets into small results set