Week 9 L1 Flashcards
Problems with traditional database?
Single point of failure if machine, storage, or network breaks
Must scale up vertically to bigger machine.
Expensive, inflexible, one way
What do we mean by database on cluster?
Hundreds of connected commodity machines
Advantages of database on cluster
Data replicated across machines to provide resilience.
No SPOF: replicas on other nodes available
Can scale out horizontally by adding more machines (cheaper, flexible, scale in or out e.g. rent cloud services.
Why replicate data?
Resilience: databases and networks fail, but business must continue as normal
Performance improves to some extent.
By adding access to local replica or by balancing of workload
What is synchronous replication?
All replicas updated on every write.
Reads are guaranteed to be up to date so safe to read from any node.
A read must wait for all machines to be updated, so can be to slow for some applications.
Only used if reads MUST be up to date.
Works best for fewer writes e.g., online banking
What is Asynchronous replication
Writes propogates as soon as possible, but reads do not wait
This means read can be out of date
Eventual consistency
Works well if reads can be a little out of date e.g. social media posts
These methods include primary site , or peer to peer.
What is primary site method?
Used by mongo db
One replica is primary node and other nodes are secondary nodes.
All writes go to primary nodes and then propagated to secondaries.
Secondaries can be read but not written.
Not SPOF if primary fails other select new primary.
Reading from primary gives strict consistency, whereas reading from secondary may be stale.
Why is eventual consistency useful?
Reads spread across multiple secondary nodes, this increases performance.
Offline analytics can read historical data from secondary node to avoid overloading primary.
What is the peer to peer method?When is it used ?
All nodes are allowed to accept reads and writes ( no primary nodes)
This reduces latency in systems with high write rate (no primary node bottleneck)
Can cause inconsistency problems – say where two peers receive conflicting updates.
Used for high velocity write once apps or where data has one owner.
What is sharding?
Partitioning a database into subsets of data so the data is spread across the nodes in a cluster.
Might split data by location.
Advantages of sharding
paritions the database into subsets of data
allowing the data to be spread across nodes in a cluster
What is a sharding key?
This determines the distribution of records among the shards.
Based on one or more chosen fields
Shard keys can be chosen manually
How to choose a sharding key?
Sharding key must appear in every document.
Key should be splitable with high granularity.
Key should be uniformly distributed across records.
Key should relate to queries for fast performance.
Use compound key if no single key is suitable.
What is query isolation?
For a query where key values determine a single shard, read and writes are faster.
For queries that don’t include shard key, then all shards must be polled so these queries take longer to complete.
Knowledge of significant queries for application is important for choosing shard key.
Ranged bases sharding keys
divides data into contiguous ranges determined by shared key value. Documents with close shard key values are likely to be on the same chunk or shard.