Databases Flashcards

1
Q

What are the advantages of SQL over NoSQL

A

ACID vs CAP. Supports many to many relationships very well through normalized relational data.

If ACID is a much, sql is the way to go (like financial transactions).

Good for transactions though NoSQL supports that too, financial data may want sql for acid.

Supports dynamic querying though ORMs should resolve this issue anyways.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When should you use NoSQL overs equal. What are all he advantages

A

Essentially Key-value store:

Ability to store unstructured data, have flexible data models

Having a one to many relationship for data, embedding references instead of having relational ones, more human readable and faster machine readable than ‘normalization’.

Can capture highly complex parent child hierarchical structures
Good for horizontal scalability (have automatic solutions too).

Many have built in scalability, sharing and availability requirements for modern web apps.

Good for development because they are unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are advantages and disadvantages of indexes

A

Advantage, reads are fast
Disadvantage, writes slow down

They are normally stored as a B-tree or hash depending on what you’re querying in SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do database indexes use a B-tree sometimes instead of a hash table?

A

A B-tree is a generalized binary tree with ‘m’ children.

This allows for Log(n) range selection for sequential data. and log(n) for column comparisons for =,>,>=, between, etc

A hash table would have to do O(n) operations to get a range or comparisons

Hashtables are more useful for primary keys, distinct/unique/ or foreign keys and for ‘=’ or ‘not = ‘ .

Hashtables are also harder to re-index when scaling, since you have pre-defined buckets. This can be inefficient storage.

more reading https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do multi column indexes work in SQL

A

not sure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does ACID Stand for

A

Atomicity - the ability to transactionally commit changes to the database and to rollback changes if there are errors. ‘ All or nothing’.

Consistency - Transactions find the database and leave database in valid state. Transactions are violating any constraints other they will be rolled back. Keep same datatypes, or foreign key/primary key relationships.

Isolation - no other transactions can affect the current transaction in question. Prevents “mid-air Collins”

Durability - System failures or restarts do not affect committed transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is InnoDB?

A

Its is a storage engine. MySQL in particular uses this.

Supports ACID transactions, caches, B-tree indexes, encryption foreign key support and has row granularity for locking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the need for index locking, and index concurrency.

A

Indexes dramatically increase read speed but slow down write speed, due to adding keys to the index when writing and updating. Only add indexes if necessary.

Index locking is needed for ACID. If a write or update is done with data in an index, the index needs to maintain its integrity so as to not provide bad data to other transactions.

https: //en.wikipedia.org/wiki/Index_locking
https: //en.wikipedia.org/wiki/Database_index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Denormalization

A

Denormalization is a concept which allows for moving data from requiring a bunch of sql ‘joins’ to just reads and doing joins in code or storing differently in a nosql database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Database ‘sharding’?

When do we need to shard?

A

Splitting database into smaller pieces. Also called horizontal partitioning.

You should shard when

1) The write workload on a single server exceeds capacity.
2) Working set no longer fits in ram
3) Single server no longer handle the size of a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sharding

advantages and disadvantages

A

Advantages

1) It makes database searches smaller and faster and easier to maintain.
2) Often reduces transaction costs.
3) Easier to fit on smaller machines rather than one big machine

Disadvantages:
1)sharding itself can be a complex or expensive operation.

2) Data integrity/staleness - Have to make sure data integrity exists during reads and writes, and referential integrity (FK/PK) maintain integrity. Often need more cleanup to delete stale references
3) Joins across shards are not efficient. Should restrict “galactic” reporting. Do merging in applications wherever possible. Denormalization can solves this but then data inconsistency becomes an issue
4) Shard Key must be chosen well. Otherwise hot spots can occur.
5) schema changes can be expensive .
6) Rebalancing can be expensive or require downtime if we change our partitioning scheme or find hotspots. May be solved with directory based partitioning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between an exclusive lock and shard lock

A

An exclusive lock (also called a write lock) prevents any other transaction from reading or writing a record. Good for atomicity.

A Shard lock ensures no writes are made to a row that is locked while other rows are reading it.

By their nature any number of transactions can hold a shared lock, while one transaction at a time can hold an exclusive lock.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What should you consider when picking a shard key

A

Shard keys should have high cardinality (measure of elements with in a set) and avoid hotspots with low frequency (lots of keys tending to fall on a particular shard)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the CAP Theorem

A

CAP Theorem - Impossible for a distributed system to have more than two out of the three following guarantees:

Consistency - all nodes see same data t the same time. Need to update all nodes before further reads

Availability - every request to the system gets a response. Can do this by replicating servers (scaling horizontally).

Partition Tolerance - a system is resilient to failure in any of its parts. Data is sufficiently replicated across nodes and networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are different partitioning techniques

A

Horizontal Paritioning: (range based partitioning - data sharding) - placing different rows into different tables.

Vertical Partitioning - moving different tables to different DBs by function.

Directory Based Partitioning - “loose coupling” placing a lookup (separate) service in front of databases, which knows the partition scheme. Solves elastic scaling problem without using consistent hashing.

Criteria

Key-Hash Based partitioning: Keys are ran through. hash function which then determine which database to put the key in. Disadvantage is rebalancing is difficult when servers are added and removed. This problem can be solved with consistent hashing.

List Partitioning: Each list value is assigned a server (like location based partitioning)

Round-robin partitioning: simple strategy that ensures uniform data distribution.

composite partitioning: a combination of any above. like List + hash partitioning. Consistent hashing could be considered a composite of hash and list partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is consistent hashing needed?

How does it work?

A

Consistent hashing solves the problem of needing elastic horizontal scalability. It avoids have to re-hash entries in db or server shards.

Lets say we have n servers and k keys.

1) create a hash space where keys will map to
2) if we imagine the hash space as circle - we hash a key to get an integer, then find the next integer that maps to an actual server
3) Adding a Server - the advantage is if we add a new server then we only need to re-map k/n keys on average (the space between servers in the hash space, rather than nearly all the keys). The more servers, the less you need to re-hash.

4) removing a server - in a similar fashion if a server goes down only keys in the space of the server before it on the circle will be affected and need to be remapped. (K/N KEYS).
5) final piece is adding virtual nodes which also map to real servers which essentially increases n and solves the problem of a server going down creating hotspotsOR un-uniform keys. high n means keys being more evenly mapped and more resistant to keys having to be re-mapped

Note: in real systems numbers of replicas is very large (>100)

17
Q

Sparse Index vs Dense Index

A

A sparse index points to a block of data to look a value up where as a dense index has a pointer for every key to its row.

18
Q

Reverse Index

A

Reverse indexes store values backwards, sometimes useful for numbers where keys increase in value or searching a string that ends with a certain value ‘%somethign’.

19
Q

Bit Map Index

A

A special index that stores its data as bit arrays and answers most queries by performing bitwise logical operations. Bit Map indexes are useful for highly repeated index values.

20
Q

What are types of no-sql data stores

A

Key-Value stores: Redis (also a caching mechanism), Voldemort and dynamo. Just has a key which is linked to a value.

Document Databases - MongoDB, CouchDB. Data is stored in documents and documents are grouped into collections.

Wide-Column Databases - Cassandra and HBase. We have column families instead of ‘tables’ . We don’t need to know all columns upfront. Best suited for large datasets

Graphdatabses: Infinitegraph, Neo4J. Use to store data who’s relationships are well represented in a. graph. Data