Distributed databases and filesystems Flashcards

1
Q

Why do we need to replicate data?

A

Replicating data allows
the system to work, even when parts are down.
Having the data (geographically) close to the clients.
Increasing throughput, allowing more machines to server read-only requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the most common replication architectures?

A

Includes two roles:
Master (nodes that accept writes from clients)
Slaves (nodes that provide read-only access)

This gives the following three architectures:
Master-Slave: One leader accepts writes, and distributes them
Allows scaling reads, but has bad availability if the master crashes
Master-master: Multiple masters accept writes, keep themselves in sync, and update slaves
Allows scaling writes, and is persistent if one master crashes, but can lead to conflicts with concurrent writes.
Leaderless: All nodes are peers in the replications network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why do we need to partition datasets?

A

Because it allows scalability, such as running queries in parallel on parts of the dataset,
and reads and writes can be spread on multiple machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the most common transaction isolation levels?

A

There are two common levels, and three weaker levels:
Serializability: Only execute in a single core, and use 2 Phase Locking (2PL), MVCC (see below), and Copy-on-Write data structures.
Multi-Version Concurrency Control (MVCC): “works like git”. Each transaction sees the most recent copy of the data. When that transaction commits, any object changed by it will be updated to a new version. If an object was updated before, the DB will report a conflict.

The three weaker levels are:
Repeatable read, which does not protect against Phantom reads (PR)
Read committed, which does not protect against non-repeatable reads (NRR) and PR
Read uncommitted, which does not protect against dirty reads. NRR and PR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the dangers of weaker isolation?

A

There are three possible dangers with weaker isolation:
Dirty reads: a transaction reads data written by a concurrent uncommitted transaction.
Non-repeatable reads: A transaction re-reads data previously read and finds that data modified
Phantom reads: Results to queries changes due to other transactions being committed.
Serializability and MVCC protect against all of these

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does ACID mean?

A

Atomicity: Transactions either fully succeed or fully fail. ie, no writes
Consistency: Any transaction will bring the Database from one valid state to another
Isolation: Any transaction has no influence on other, concurrent transactions.
Durability: Once a transaction has been committed, it will persist indefinitely, regardless of faults/crashes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe how HDFS will store a file.

A

The client requests the NameNode to write a file.
The NameNode provides the address of the DataNodes.
Then the client directly writes the data on the DataNodes.
Internally the DataNodes will replicate the data x times.
Once the data is replicated, the DataNode sends an acknowledgment to the client.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly