Distributed Databases and MapReduce Flashcards
1
Q
Data Partitioning
A
- Data is partitioned or fragmented across multiple machines
2
Q
Data Replication
A
- Copies of the same data are made available on multiple machines
3
Q
Horizontal Fragmentation
A
- Divides up the rows of a collection of records
4
Q
Vertical Fragmentation
A
- Divides up the columns of a collection of records
5
Q
Advantages of a Distributed Database
A
- Improves performance
- High availability
- Modular growth
- Integrates data from multiple existing systems
6
Q
Challenges of a Distributed Database
A
- Distributing the data
- Efficient query execution
- Maintaining integrity constraints (PK, FK, etc)
- Replicated data remains consistent
- Managing distributed transactions
7
Q
Distributed Transaction
A
- A transaction that involves data stored at multiple sites
- One site serves as the coordinator
8
Q
Synchronous Replication
A
- Transactions are guaranteed to see the most up-to-date value of an item
9
Q
Asynchronous Replication
A
- Transactions are not guaranteed to see the most up-to-date value of an item
10
Q
Primary-Site Replication
A
- One replica is designated the primary replica
- Receives all writes and updates the secondary replicas
11
Q
Peer-to-Peer Replication
A
- More than one replica can be updated
12
Q
Synchronous Replication: Read-Any, Write-All
A
- When reading an item, access any of the replicas
- When writing an item, must update all of the replicas
13
Q
Synchronous Replication: Voting
A
- n = number of copies, w = copies written, r = copies read
- Need r > n - w
14
Q
Global Locks
A
- Shared and exclusive locks for a logical item
- No two transactions can hold a global exclusive lock for the same item
- Any number of transactions can hold a global shared lock for an item
15
Q
Centralized Locking
A
- One site manages the lock requests for all items in the distributed database
- The lock site can become a bottleneck
16
Q
Primary-Copy Locking
A
- One copy of an item is designated the primary copy
- The site holding the primary copy handles all lock requests for that item
17
Q
Fully Distributed Locking
A
- A transaction acquires a global lock for an item by locking a sufficient number of the item’s copies
- n = total copies, x = number locked for global exclusive lock, s = number locked for global shared lock
- Need x > n / 2
- Need s > n - x
18
Q
Distributed Deadlock Handling
A
- Difficult to detect deadlock, so roll back a transaction if it waits too long (timeout)
19
Q
MapReduce
A
- Splits the collection of records into subcollections that are processed in parallel
20
Q
Benefits of MapReduce
A
- Parallel processing
- Fewer data transfer across machines
- Fault tolerance
21
Q
Mapper
A
- Applies a map function to each record to create (key, value) pairs
22
Q
Reducer
A
- Applies a reduce function to each (key, value list)
23
Q
Chaining MapReduce Jobs
A
- Map the reduced results of the first job to an arbitrary constant key to create one reducer task