Database Selection & Data Modelling Flashcards

1
Q

What are two design choices made by SQL databases that impact their scalability?

A
  1. Strong Consistency over Availability [across tables, across servers]
  2. No Data Duplication [chose normalization and hence JOINS].
    Note: Data Duplication can scale but JOINs can’t.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which database will offer faster writes, Cassandra or MongoDB?

A

Write in MongoDB needs to update applicable indexes whereas its not required for Cassandra.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the challenges of sharding data while fetching data?

A
  1. Aggregating data (e.g. sorting) with “Scatter & Gather” queries is not efficient.
  2. JOIN across shards
  3. Schema Migration across shards

Note: Mongo does local sorting on each shard followed by merge sort in the primary shard.
Cassandra doesn’t allow queries where the partition is not filtered (not always??): TBD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You need to serve both real-time queries and analytical queries. How will you approach it?

A

In MongoDB, analytical queries can be served from replicas with indexes tuned for analytical queries. No need to use the real-time indexes on the replica.
In Cassandra, we need to duplicate data with a different Primary key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is the process of creating a sharding key in Mongo equivalent to creating a Partition or Primary key in Cassandra?

A

Partition Key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the rules to create a good partition or sharding key?

A

High uniqueness so that data is evenly distributed.
Tactic: Make it a compound key with a monotonically increasing key (e.g. counter or ts) at the end of the compound key. This additional key not part of the actual data is called a surrogate key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two database-neutral steps while doing data modeling?

A

Conceptual data model (know your data, entities, and relationships)
Application Workflow & Access Patterns (know your queries, frequency & freshness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List four different bugs which can come due to the parallelization of transactions (w/o isolation i.e. w/o serializing them).

A
  1. Lost Update: T1 & T2 reads at same time. T2 updates modification done by T1.
  2. Dirty Read: T2 reading data that is modified by another ongoing/uncommitted transaction.
  3. Non Repeatable Read: Two reads in the same transaction but different values are returned.
  4. Phantom Read Anomaly: Two reads in the same transaction but new inserts were made by another b/w two reads.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are different isolation levels defined by the SQL standard?

A

READ UNCOMMITTED (All anomalies, No Lock)
READ COMMITTED
REPEATABLE READ* (Lock on Row)
SERIALIZABLE (No anomaly, Lock on table)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which database will offer faster reads, Cassandra or MongoDB?

A

Data in Cassandra, at any given time, can be spread across multiple SSL tables in disk and memtables in memory. For reading, we need to read and consolidate tables from multiple tables whereas we don’t have such a requirement for Mongo. Hence, MongoDB is supposed to be faster for read operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly