Database Selection & Data Modelling Flashcards

Question 1

Q

What are two design choices made by SQL databases that impact their scalability?

Answer

A

Strong Consistency over Availability [across tables, across servers]
No Data Duplication [chose normalization and hence JOINS].
Note: Data Duplication can scale but JOINs can’t.

Question 2

Q

Which database will offer faster writes, Cassandra or MongoDB?

Answer

A

Write in MongoDB needs to update applicable indexes whereas its not required for Cassandra.

Question 3

Q

What are the challenges of sharding data while fetching data?

Answer

A

Aggregating data (e.g. sorting) with “Scatter & Gather” queries is not efficient.
JOIN across shards
Schema Migration across shards

Note: Mongo does local sorting on each shard followed by merge sort in the primary shard.
Cassandra doesn’t allow queries where the partition is not filtered (not always??): TBD

Question 4

Q

You need to serve both real-time queries and analytical queries. How will you approach it?

Answer

A

In MongoDB, analytical queries can be served from replicas with indexes tuned for analytical queries. No need to use the real-time indexes on the replica.
In Cassandra, we need to duplicate data with a different Primary key.

Question 5

Q

Is the process of creating a sharding key in Mongo equivalent to creating a Partition or Primary key in Cassandra?

Answer

A

Partition Key.

Question 6

Q

What are the rules to create a good partition or sharding key?

Answer

A

High uniqueness so that data is evenly distributed.
Tactic: Make it a compound key with a monotonically increasing key (e.g. counter or ts) at the end of the compound key. This additional key not part of the actual data is called a surrogate key.

Question 7

Q

What are the two database-neutral steps while doing data modeling?

Answer

A

Conceptual data model (know your data, entities, and relationships)
Application Workflow & Access Patterns (know your queries, frequency & freshness)

Question 8

Q

List four different bugs which can come due to the parallelization of transactions (w/o isolation i.e. w/o serializing them).

Answer

A

Lost Update: T1 & T2 reads at same time. T2 updates modification done by T1.
Dirty Read: T2 reading data that is modified by another ongoing/uncommitted transaction.
Non Repeatable Read: Two reads in the same transaction but different values are returned.
Phantom Read Anomaly: Two reads in the same transaction but new inserts were made by another b/w two reads.

Question 9

Q

What are different isolation levels defined by the SQL standard?

Answer

A

READ UNCOMMITTED (All anomalies, No Lock)
READ COMMITTED
REPEATABLE READ* (Lock on Row)
SERIALIZABLE (No anomaly, Lock on table)

Question 10

Q

Which database will offer faster reads, Cassandra or MongoDB?

Answer

A

Data in Cassandra, at any given time, can be spread across multiple SSL tables in disk and memtables in memory. For reading, we need to read and consolidate tables from multiple tables whereas we don’t have such a requirement for Mongo. Hence, MongoDB is supposed to be faster for read operations.

Database Selection & Data Modelling Flashcards

(10 cards)