MapReduce And ACID vs BASE Flashcards
MapReduce
The way that Google splits the job of finding data into tasks for separate machines
Like a distributed GROUP BY
Phases
Map phase: Find and aggregate key values on each node. done on all nodes in parallel
Shuffle phase: group all like categories in each node
Reduce phase: totals the number of things in each category
ACID properties of big data
Atomicity
Consistency
Isolation
Durability
Atomicity
All or nothing. In a transaction all the operations must succeed or fail as a group
-> ensures data integrity
Achieved through COMMIT and ROLLBACK transactions
Consistency
Data must be consistent before and after a transaction
Achieved through Forward recovery, backward recovery
Isolation
Transactions never interfere with each other
Achieved through Locking
Durability
Transactions are permanent even if they fail
Achieved through: Forward recovery, backward recovery
Base properties (a great alternative only if query results can handle some inconsistencies)
Basic Availability
Soft state
Eventual Consistency
Basic availability
A Big data alternative that makes it able to tolerate partial failure (failure of a node)
Soft state
State of the system is in flux and may change over time
(Correctness of big data is not that important)
Eventual consistency
May not be consistent in the short run but will eventually become consistent as more data is added
Hadoop ecosystem
Tools that make Hadoop easy to use for people without Java programming skills.
Elements of the Hadoop ecosystem
Hive, Sqoop, Pig, Flume, Hbase, Impala
Hive
DW system that works with HDFS and it’s not relational.
HiveQL is a declarative (what) SQL like query language. Processes queries into MapReduce jobs.
Works best on large sets of data; doesn’t return small sets of data quickly.
Pig
Pig Latin scripting language. Procedural (how).
Compiles pig Latin into MapReduce jobs.
Good for data transformation