Distributing Databases Flashcards

1
Q

What are the requirements for building a distributed relational database system?

A
  1. Local autonomy: sites should operate INDEPENDENTLY, and not interfere with one another. No site should rely on another
  2. No centralization: no single site-failure should break the system
    No one central site crucial for everything!
  3. Continuous Operation: an aspiration - the system is AVAILABLE most of the time, and reliable - when it falls over, it should recover gracefully.
  4. Location Independence: user need not know where the data is located. Data can be moved between sites without changing functionality.
  5. Partition Independence: user does not need to know how data’s partitioned, data can be moved from one partition to another. Optimization can still benefit from partition knowledge (where names M-Z are stored), but not necessary.
  6. Replication Independence: usually distribution duplicates data, which is what we were trying to avoid!
  7. Execute queries close to the data: moving data/processing it is expensive. Generate smaller results before moving it (MapReduce takes advantage of this reduction, by moving processing close to the data).
  8. Distribution among different DBMSs should be possible: an aspiration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does “distribution making no difference to the user” mean?

A
  • Hardware/OS independence
  • Network independence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is another word for partitioning a database?

A

Fragmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “vertical partitioning”?

A

Breaking a table into columns (like in normalization) before distributing it, to reduce duplication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why use vertical partitioning?

A

When some cols used more than others, some are either read/written to a lot more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is “horizontal partitioning”?

A

Can partition rows based on hash/keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do partitioning methods depend on?

A

Foreign Keys, how to JOIN the data later. Also the col definitions and “catalogue”: a relation is not just the data!
Catalogue management has to be distributed as well. Any catalogue changes have to be distributed to maintain the integrity of the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to ensure distributed DBs have recovery control?

A

Transactions can be hardened with a 2-phase protocol: every node is locked for the duration of the operations, and only confirms them when everyone is happy. A node is assigned as the “coordinator” for the duration of the transaction. It is hard to ensure ACID properties with distributed databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Brewer’s conjecture?

A

Three goals (CAP) in tension with a distributed database.
1. Consistency: all parts of the DB should converge to a consistent state (a weaker version of ACID consistency).
2. Availability: every request should result in a response AT SOME POINT eventually, even if it’s an error.
3. Partition Tolerance: if a network flaw breaks the network into two subnets, the database should keep going, then run and recover (consistency) eventually.
Brewer’s Conjecture states these 3 cannot be fully satisfied all at once. You have to prioritize some over the other.
Which to prioritize depends on the particular use case/context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a key-value database?

A

Has 2 columns in a table. A ‘key’ column (the INDEX) and a ‘value’ column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Does a key-value database have JOINs or Foreign Keys (integrity constraints)?

A

NO

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What really are key-value pairs in terms of data structures?

A

Associative arrays (dict)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is there explicit connection between 2 entries in key-value DB?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is a key-value database good for sophisticated retrieval?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are things retrieved in a key-value database?

A

Just ‘walk over’ the whole database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why use a key-value database?

A
  • Already vertically fragmented
  • Very easy to distribute
  • Very easy to parallelize
  • Each row is already completely separated from another row!
  • Partitioning is ALWAYS horizontal as already vertically fragmented
17
Q

How should processing happen in a key-value database?

A

Raw processing should happen very close to where the data is stored.

18
Q

What are the stages of processing data from a key-value database?

A
19
Q

MapReduce - how is information first processed?

A

MAP: Gather information from the base table - to get something intermediate. Analogous to any JOINs and WHERE statements –> producing an intermediate table.

20
Q

MapReduce - what is the second stage of processing?

A

REDUCE: Aggregate, sort, do something with the intermediate results. E.g. GROUP BY, SELECT, COUNT/AVG/SUM etc.

21
Q

How/where is the MAP phase of Map-Reduce executed?

A
  1. Close to the data (direct access)
  2. Loops over all the data (or uses an index to iterate over parts of it) and transforms it (e.g. representing a word as 1) into a form useful to the reducer
  3. Outputs a new key-value set
22
Q

How/where is the REDUCE phase of Map-Reduce executed?

A
  1. Doesn’t have to happen close to the data
  2. Carried out by reducer-workers
  3. Summarize data based on keys, using equivalents of SQL GROUPBY functions –> aggregates and summarizes based on the KEY of the new dataset (e.g. counting all the different texts’ word-counts that have been mapped to 1 to output the count of words for each text)
  4. Output is collated & sent to users
23
Q

How can map-reduce be parallelized?

A

We know the reduce step is going to be based on a key. Reducer-workers are assigned based on the key (e.g. hash function to deduce which key to give to which worker).

24
Q

List 9 advantages of MapReduce and key-value databases.

A
  1. Easily parallelizable
  2. Fault tolerance –> can easily distribute processes to other nodes, easy to recover from failure of processing nodes (although there’s often a coordinate node, and some risk of centralization)
  3. Very scalable
  4. Flexible: can handle different kinds of data structures
  5. Cost-effective, don’t need expensive systems
  6. Very easy to distribute/distribute processing: it is designed for distributed data
  7. Easy to model
  8. Allows you to control the data that moves from one place to another: can decide where to do the processing.
  9. Privacy: map can run in a private space, but data sent to the reducers (i.e. anonymized) can be regarded as safe
25
Q

Summarize the stages of the map-reduce algorithm.

A
  1. Map: take input data and convert it into intermediate key-value pairs. Process chunks in parallel (scalable).
  2. Shuffle and sort: shuffle resulting key-value pairs and sort by key across parallel nodes. All values for one key are moved to a single reducer.
  3. Reduce: take the key-value pairs sorted by key, and aggregate or merge to form a smaller set of results.
26
Q

What is a key-value database?

A

A data structure with two columns - the key is a string, and the value is often a string or more complex object. Like an assoc array.

27
Q

Why does MapReduce work well with distributed architectures?

A
  • A key-value database can trivially be partitioned horizontally
  • Since the Reducer phase only needs the values for a single key at a time, the task can be parallellised easily
28
Q

What model is a compromise between relational database (hard to distribute) and key-value database (where harder to search/retrieve values)?

A

A DOCUMENT BASED DB.

29
Q

What is a document based db?

A
  • Structures are separate like rows in key-value DB
  • But more complicated
  • Less strictly controlled than rel system/tables –> column defs not prescribed
  • Rich, nested and deep data structures: less JOINs
  • Allows for repetition
  • LESS STRICT - NESTED -REPEATED
  • Less constrained, we don’t need to know columns we’ll need in advance
  • Order-sensitive (unlike SQL)
  • Documents seldom interlink: why they work as parallelizable structures
  • Documents are less interconnected than tables
  • Any connections between docs are NOT central to fast retrieval
  • Less interlinking
  • Less well-defined
  • Less guaranteeable
  • CANNOT LINK BETWEEN DOCUMENTS
  • No cross-referencing
  • Cannot embed variables in the code (unlike JavaScript)
  • XML: also a document database
30
Q

What are the JSON use cases?

A
  1. Data persistence (serialisation)
  2. Data exchange/transfer
  3. Good for file storage + writing data out
  4. Easy to serialize and share
  5. Very easy distributed processing
31
Q

What is MongoDB?

A

A document-based database model. It is open-source and has copyleft elements. It operates on JSON objects.

32
Q

How is MongoDB distributed?

A

By “sharding”: taking horizontal partitions.