Database Flashcards
What is ACID
ACID is a set of properties of relational database transactions.
- Atomicity - Each transaction is all or nothing
- Consistency - Any transaction will bring the database from one valid state to another
- Isolation - Executing transactions concurrently has the same results as if the transactions were executed serially
- Durability - Once a transaction has been committed, it will remain so
Techniques to scale a relational database
master-slave replication, master-master replication, federation, sharding, denormalization, SQL tuning.
DB scaling - Master-slave replication
The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion.
Pro:
1. Avalibility - If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned.
Con:
- Additional logic is needed to promote a slave to a master.
- single central master serializing writes, low through put
- replication cons
Disadvantage(s): replication
(DL-RS-WL-LC)
- Potential data loss
- slow down read - too much writes/ replicated will
- Replication lag - too many slave will
- increase hardware and system complexity
- loose consistency (no ACID)
- complexity
DB scaling - Master- Master replication
Pros:
1. One is down can continue serve both read and write
Cons:
- Additional load balancer to route the writes/read
- replication cons
DB scaling - Federation
Federation (or functional partitioning) splits up databases by function. For example, instead of a single, monolithic database, you could have three databases: forums, users, and products.
Trade-off: see partitioning trade off
DB scaling - Sharding
Sharding distributes data across different databases such that each database can only manage a subset of
Common ways to shard a table of users is either through the user’s last name initial or the user’s geographic location.
DB scaling - Denormalization
Pros:
1. improve read performance - avoid expensive joins
cons: (R-DUP-SYN)
1. Mainly suitable for heavy read/light writes
2. Data is duplicated.
3. Synchronization of duplication information
4. A demoralized database under heavy write load might perform worse than its normalized counterpart.
Database scaling - Partitioning /sharding/fedration
Pros:(OS-IS-WRP-WRL-CH)
- Overall-scaling - when more users, just add more shard(consistent hashing to avoid re-balancing)
- independent scaling - useful when some functionality/shard/partition have much more data than others
- write/read in parallel, improve throughput
- less / read/ write traffic and replication for each partition
- improve cache hits - as same query always go to same db server
Cons: (DLRC)
- potential data loss
- load balancer/router determines read/write
- difficult joining
- One partition can become super big - result in complex re-balancing
- more hard ware, more complexity, more complex application logic
NoSQL
NoSQL is a collection of data items represented in a key-value store, document store, wide column store, or a graph database. Data is denormalized, and joins are generally done in the application code. Most NoSQL stores lack true ACID transactions and favor eventual consistency.
BASE is often used to describe the properties of NoSQL databases. In comparison with the CAP Theorem, BASE chooses availability over consistency.
Basically available - the system guarantees availability.
Soft state - the state of the system may change over time, even without input.
Eventual consistency - the system will become consistent over a period of time, given that the system doesn’t receive input during that period.
In addition to choosing between SQL or NoSQL, it is helpful to understand which type of NoSQL database best fits your use case(s). We’ll review key-value stores, document stores, wide column stores, and graph databases in the next section.
NoSQL - Key-value store
A key-value store generally allows for O(1) reads and writes and is often backed by memory or SSD. Data stores can maintain keys in lexicographic order, allowing efficient retrieval of key ranges. Key-value stores can allow for storing of metadata with a value.
Key-value stores provide high performance and are often used for simple data models or for rapidly-changing data, such as an in-memory cache layer. Since they offer only a limited set of operations, complexity is shifted to the application layer if additional operations are needed.
A key-value store is the basis for more complex systems such as a document store, and in some cases, a graph database.
NoSQL - Document store
A document store is centered around documents (XML, JSON, binary, etc), where a document stores all information for a given object. Document stores provide APIs or a query language to query based on the internal structure of the document itself. Note, many key-value stores include features for working with a value’s metadata, blurring the lines between these two storage types.
Based on the underlying implementation, documents are organized by collections, tags, metadata, or directories. Although documents can be organized or grouped together, documents may have fields that are completely different from each other.
Some document stores like MongoDB and CouchDB also provide a SQL-like language to perform complex queries. DynamoDB supports both key-values and documents.
Document stores provide high flexibility and are often used for working with occasionally changing data.
NoSQL - Wide column store
The names and format of the columns can vary from row to row in the same table
Keyspace - is kind of like a schema in the relational model. The keyspace contains all the column families.
Column families - (kind of like tables in the relational model), which contain rows
Row - rowKey and columns.
Column - name/value/timestamp
Super column families - further group column families
Google introduced Bigtable as the first wide column store, which influenced the open-source HBase often-used in the Hadoop ecosystem, and Cassandra from Facebook. Stores such as BigTable, HBase, and Cassandra maintain keys in lexicographic order, allowing efficient retrieval of selective key ranges.
Wide column stores offer high availability and high scalability. They are often used for very large data sets.
NoSQL - Graph database
In a graph database, each node is a record and each arc is a relationship between two nodes. Graph databases are optimized to represent complex relationships with many foreign keys or many-to-many relationships.
Graphs databases offer high performance for data models with complex relationships, such as a social network. They are relatively new and are not yet widely-used; it might be more difficult to find development tools and resources. Many graphs can only be accessed with REST APIs.
SQL or NoSQL
Reasons for SQL:
Structured data Strict schema Relational data Need for complex joins Transactions Clear patterns for scaling More established: developers, community, code, tools, Lookups by index are very fast
Reasons for NoSQL: Semi-structured data Dynamic or flexible schema Non-relational data No need for complex joins Store many TB (or PB) of data Very data intensive workload Very high throughput for IOPS Sample data well-suited for NoSQL:
Rapid ingest of clickstream and log data Leaderboard or scoring data Temporary data, such as a shopping cart Frequently accessed ('hot') tables Metadata/lookup tables