Distributing Databases Flashcards
What are the requirements for building a distributed relational database system?
- Local autonomy: sites should operate INDEPENDENTLY, and not interfere with one another. No site should rely on another
- No centralization: no single site-failure should break the system
No one central site crucial for everything! - Continuous Operation: an aspiration - the system is AVAILABLE most of the time, and reliable - when it falls over, it should recover gracefully.
- Location Independence: user need not know where the data is located. Data can be moved between sites without changing functionality.
- Partition Independence: user does not need to know how data’s partitioned, data can be moved from one partition to another. Optimization can still benefit from partition knowledge (where names M-Z are stored), but not necessary.
- Replication Independence: usually distribution duplicates data, which is what we were trying to avoid!
- Execute queries close to the data: moving data/processing it is expensive. Generate smaller results before moving it (MapReduce takes advantage of this reduction, by moving processing close to the data).
- Distribution among different DBMSs should be possible: an aspiration
What does “distribution making no difference to the user” mean?
- Hardware/OS independence
- Network independence
What is another word for partitioning a database?
Fragmentation
What is “vertical partitioning”?
Breaking a table into columns (like in normalization) before distributing it, to reduce duplication.
Why use vertical partitioning?
When some cols used more than others, some are either read/written to a lot more
What is “horizontal partitioning”?
Can partition rows based on hash/keys.
What do partitioning methods depend on?
Foreign Keys, how to JOIN the data later. Also the col definitions and “catalogue”: a relation is not just the data!
Catalogue management has to be distributed as well. Any catalogue changes have to be distributed to maintain the integrity of the database.
How to ensure distributed DBs have recovery control?
Transactions can be hardened with a 2-phase protocol: every node is locked for the duration of the operations, and only confirms them when everyone is happy. A node is assigned as the “coordinator” for the duration of the transaction. It is hard to ensure ACID properties with distributed databases.
What is Brewer’s conjecture?
Three goals (CAP) in tension with a distributed database.
1. Consistency: all parts of the DB should converge to a consistent state (a weaker version of ACID consistency).
2. Availability: every request should result in a response AT SOME POINT eventually, even if it’s an error.
3. Partition Tolerance: if a network flaw breaks the network into two subnets, the database should keep going, then run and recover (consistency) eventually.
Brewer’s Conjecture states these 3 cannot be fully satisfied all at once. You have to prioritize some over the other.
Which to prioritize depends on the particular use case/context.
What is a key-value database?
Has 2 columns in a table. A ‘key’ column (the INDEX) and a ‘value’ column.
Does a key-value database have JOINs or Foreign Keys (integrity constraints)?
NO
What really are key-value pairs in terms of data structures?
Associative arrays (dict)
Is there explicit connection between 2 entries in key-value DB?
No
Is a key-value database good for sophisticated retrieval?
No
How are things retrieved in a key-value database?
Just ‘walk over’ the whole database.