Database Flashcards

1
Q

What is ACID

A

ACID is a set of properties of relational database transactions.

  • Atomicity - Each transaction is all or nothing
  • Consistency - Any transaction will bring the database from one valid state to another
  • Isolation - Executing transactions concurrently has the same results as if the transactions were executed serially
  • Durability - Once a transaction has been committed, it will remain so
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Techniques to scale a relational database

A
master-slave replication, 
master-master replication,
federation,
sharding, 
denormalization, 
SQL tuning.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

DB scaling - Master-slave replication

A

The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate to additional slaves in a tree-like fashion.

Pro:
1. Avalibility - If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned.

Con:

  1. Additional logic is needed to promote a slave to a master.
  2. single central master serializing writes, low through put
  3. replication cons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Disadvantage(s): replication

A

(DL-RS-WL-LC)

  1. Potential data loss
  2. slow down read - too much writes/ replicated will
  3. Replication lag - too many slave will
  4. increase hardware and system complexity
  5. loose consistency (no ACID)
  6. complexity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DB scaling - Master- Master replication

A

Pros:
1. One is down can continue serve both read and write

Cons:

  1. Additional load balancer to route the writes/read
  2. replication cons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DB scaling - Federation

A

Federation (or functional partitioning) splits up databases by function. For example, instead of a single, monolithic database, you could have three databases: forums, users, and products.
Trade-off: see partitioning trade off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DB scaling - Sharding

A

Sharding distributes data across different databases such that each database can only manage a subset of

Common ways to shard a table of users is either through the user’s last name initial or the user’s geographic location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

DB scaling - Denormalization

A

Pros:
1. improve read performance - avoid expensive joins

cons: (R-DUP-SYN)
1. Mainly suitable for heavy read/light writes
2. Data is duplicated.
3. Synchronization of duplication information
4. A demoralized database under heavy write load might perform worse than its normalized counterpart.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Database scaling - Partitioning /sharding/fedration

A

Pros:(OS-IS-WRP-WRL-CH)

  1. Overall-scaling - when more users, just add more shard(consistent hashing to avoid re-balancing)
  2. independent scaling - useful when some functionality/shard/partition have much more data than others
  3. write/read in parallel, improve throughput
  4. less / read/ write traffic and replication for each partition
  5. improve cache hits - as same query always go to same db server

Cons: (DLRC)

  1. potential data loss
  2. load balancer/router determines read/write
  3. difficult joining
  4. One partition can become super big - result in complex re-balancing
  5. more hard ware, more complexity, more complex application logic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

NoSQL

A

NoSQL is a collection of data items represented in a key-value store, document store, wide column store, or a graph database. Data is denormalized, and joins are generally done in the application code. Most NoSQL stores lack true ACID transactions and favor eventual consistency.

BASE is often used to describe the properties of NoSQL databases. In comparison with the CAP Theorem, BASE chooses availability over consistency.

Basically available - the system guarantees availability.
Soft state - the state of the system may change over time, even without input.
Eventual consistency - the system will become consistent over a period of time, given that the system doesn’t receive input during that period.
In addition to choosing between SQL or NoSQL, it is helpful to understand which type of NoSQL database best fits your use case(s). We’ll review key-value stores, document stores, wide column stores, and graph databases in the next section.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

NoSQL - Key-value store

A

A key-value store generally allows for O(1) reads and writes and is often backed by memory or SSD. Data stores can maintain keys in lexicographic order, allowing efficient retrieval of key ranges. Key-value stores can allow for storing of metadata with a value.

Key-value stores provide high performance and are often used for simple data models or for rapidly-changing data, such as an in-memory cache layer. Since they offer only a limited set of operations, complexity is shifted to the application layer if additional operations are needed.

A key-value store is the basis for more complex systems such as a document store, and in some cases, a graph database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

NoSQL - Document store

A

A document store is centered around documents (XML, JSON, binary, etc), where a document stores all information for a given object. Document stores provide APIs or a query language to query based on the internal structure of the document itself. Note, many key-value stores include features for working with a value’s metadata, blurring the lines between these two storage types.

Based on the underlying implementation, documents are organized by collections, tags, metadata, or directories. Although documents can be organized or grouped together, documents may have fields that are completely different from each other.

Some document stores like MongoDB and CouchDB also provide a SQL-like language to perform complex queries. DynamoDB supports both key-values and documents.

Document stores provide high flexibility and are often used for working with occasionally changing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

NoSQL - Wide column store

A

The names and format of the columns can vary from row to row in the same table

Keyspace - is kind of like a schema in the relational model. The keyspace contains all the column families.
Column families - (kind of like tables in the relational model), which contain rows
Row - rowKey and columns.
Column - name/value/timestamp
Super column families - further group column families

Google introduced Bigtable as the first wide column store, which influenced the open-source HBase often-used in the Hadoop ecosystem, and Cassandra from Facebook. Stores such as BigTable, HBase, and Cassandra maintain keys in lexicographic order, allowing efficient retrieval of selective key ranges.

Wide column stores offer high availability and high scalability. They are often used for very large data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

NoSQL - Graph database

A

In a graph database, each node is a record and each arc is a relationship between two nodes. Graph databases are optimized to represent complex relationships with many foreign keys or many-to-many relationships.

Graphs databases offer high performance for data models with complex relationships, such as a social network. They are relatively new and are not yet widely-used; it might be more difficult to find development tools and resources. Many graphs can only be accessed with REST APIs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SQL or NoSQL

A

Reasons for SQL:

Structured data
Strict schema
Relational data
Need for complex joins
Transactions
Clear patterns for scaling
More established: developers, community, code, tools, 
Lookups by index are very fast
Reasons for NoSQL:
Semi-structured data
Dynamic or flexible schema
Non-relational data
No need for complex joins
Store many TB (or PB) of data
Very data intensive workload
Very high throughput for IOPS
Sample data well-suited for NoSQL:
Rapid ingest of clickstream and log data
Leaderboard or scoring data
Temporary data, such as a shopping cart
Frequently accessed ('hot') tables
Metadata/lookup tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DB scaling - SQL tuning

A

It’s important to benchmark and profile to simulate and uncover bottlenecks.

Benchmark - Simulate high-load situations with tools such as ab.
Profile - Enable tools such as the slow query log to help track performance issues.
Benchmarking and profiling might point you to the following optimizations.

Tighten up the schema:
1. MySQL dumps to disk in contiguous blocks for fast access.
2. Use CHAR instead of VARCHAR for fixed-length fields.
3. CHAR effectively allows for fast, random access, whereas with VARCHAR, you must find the end of a string before moving onto the next one.
Use TEXT for large blocks of text such as blog posts. TEXT also allows for boolean searches. Using a TEXT field results in storing a pointer on disk that is used to locate the text block.
Use INT for larger numbers up to 2^32 or 4 billion.
Use DECIMAL for currency to avoid floating point representation errors.
Avoid storing large BLOBS, store the location of where to get the object instead.
VARCHAR(255) is the largest number of characters that can be counted in an 8 bit number, often maximizing the use of a byte in some RDBMS.
Set the NOT NULL constraint where applicable to improve search performance.
Use good indices
Columns that you are querying (SELECT, GROUP BY, ORDER BY, JOIN) could be faster with indices.
Indices are usually represented as self-balancing B-tree that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
Placing an index can keep the data in memory, requiring more space.
Writes could also be slower since the index also needs to be updated.
When loading large amounts of data, it might be faster to disable indices, load the data, then rebuild the indices.
Avoid expensive joins
Denormalize where performance demands it.
Partition tables
Break up a table by putting hot spots in a separate table to help keep it in memory.
Tune the query cache
In some cases, the query cache could lead to performance issues.