SD General Flashcards

1
Q

What are the common use cases that event order are really important

A

Payments transaction.
Video conferencing.
Vehicle control systems.
Online gaming.
Sensor metrics publishing and changes capture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why we don’t want to store database in the application machines?

A

Bottleneck; independent scaling.
Typically application servers become a bottleneck to the system before the actual accessing to the data does.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is write ahead log and why do we need it

A

Recovery from crash - data durability. RAM lose data after crash.

This is an append-only file to which every DB modification must be written before it can be committed to the DB. When the database comes back up after a crash, this log is used to restore the DB back to a consistent state.

Write ahead log is completely on disk so it’s durable, we can just replay the records in the WAL to repopulate the data.

Atomicity - revert some writes per errors.
Durability and consistency - replay the logs and rebuild the state.
———————
In database systems that use Write-Ahead Logging (WAL) and implement row-level locking, the behavior regarding when writes are recorded in the WAL relative to locking can vary depending on the specific database’s implementation and configuration. However, there are common patterns and practices that can be generalized to understand how most systems likely behave.

  1. Acquiring the Lock:
    • Before any write operation (update, delete, or insert) can be applied to a row in a database, the transaction that intends to make the change must first acquire the appropriate lock on that row. This lock ensures that no other concurrent transactions can modify the same row until the lock is released.
  2. Writing to the WAL:
    • Once the lock is successfully acquired and the row is effectively under the control of the current transaction, the changes are first recorded in the WAL. This step is crucial for ensuring data durability and atomicity; the database logs the intention to make changes before any actual modifications are applied to the data files.
    • The WAL entry typically includes information necessary to redo the operation (in case of a crash after the WAL record is written but before the changes are committed) and to undo the operation (in case the transaction needs to be rolled back).
  3. Applying Changes to the Data:
    • After the WAL is updated, and the changes are securely logged, the transaction then applies these changes to the actual data stored in the database. This step is performed while the row lock is still held, preventing other transactions from accessing the row until the current transaction is complete.
  4. Committing the Transaction:
    • The transaction can be committed only after the WAL records are safely written to disk. Once the commit is successful, the locks can be released, allowing other transactions to access or modify the row.
  • Order of Operations: The crucial aspect is that the WAL is written to after acquiring the necessary locks but before the changes are committed and the locks are released. This ensures that the database can always recover to a consistent state, as the WAL contains all the information required to redo or undo transactions up to the last known good state.
  • Handling Deadlocks and Delays: If a lock cannot be acquired because another transaction holds a conflicting lock, the transaction will typically wait until the lock is available. During this waiting period, nothing is yet written to the WAL. If a deadlock is detected (where two or more transactions are waiting on each other indefinitely), one of the transactions will be chosen as a victim and rolled back to resolve the deadlock.
  • Performance Implications: The need to acquire locks before writing to the WAL can impact performance, particularly in high-contention environments where many transactions are competing for the same rows. Database systems often implement sophisticated concurrency control mechanisms to minimize locking delays and optimize transaction throughput.

In summary, in systems using WAL and row-level locking, the write to the WAL occurs after the lock on the row is successfully acquired but before the transaction is committed. This sequence ensures that all changes logged in the WAL are guaranteed to be applied atomically and durably, maintaining the integrity and consistency of the database even in the face of system failures or crashes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is read and write time complexity of hash indexes? Why hash indexes are only stored in memory? Why range query is expensive for hash index?

A

Redis for key-value retrieval.
Riak for key-document retrieval.
PostgreSQL, MySQL for exact match queries.
Cassandra use it to map data to nodes using hash value of partition keys.

Basically hash map.
O(1) write and read for hash function, because memory or RAM is O(1) for random access by direct addressing.
Optimized for point queries, where we need to find rows with specific keys.

Subject to the requirement that all indexes can fit into memory. RAM is expensive, less, not durable.

Hash key is randomly and evenly distributed, and inherently lacking of order. Need to scan the entire hash table to find all keys that falls into the range.

Hash indexes are not good for hard disk due to random distribution of data blocks on disk, and therefore lack spatial locality, related data are not stored closer to each other. The disk arm or pointer needs to jump around.

Rehashing and redistribution leads to disk I/O overhead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is branching factor in B tree and how does it affect the time complexity of b-tree

A

MySQL, PostgreSQL, Oracle, MongoDB.

Number of children a node can have. Typically it is several hundred, depends on the storage to store page pointers and the range boundaries. Most databases can fit into B-tree of 3 or 4 levels. A 4 level tree of 4 KB pages with branching factor 500 can store up to 256 TB.

The higher the branching factor is, the less height the tree has, and therefore improving the read and write efficiency.

However, a large node can increase memory overhead, because some keys that act as separators can be stored in memory to speed up get guide to the correct leaf node.

DDIA 81

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do we still need write ahead logs for B-tree indexes?

A

DDIA 82.
Operations requiring several pages to be overridden.
If the node crashes when you are partially done with page splits, you are ended up with a corrupted indexes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we delete a record from the log structured file segments in disk, SST or simple unsorted segments

A

Append a special deletion record sometimes called a tombstone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

LSM tree + SST
Advantages over hash index + log structure files

A

Paper that proposed this idea came out at 1996.

Apache Cassandra - minimize disk I/O and maximize write throughput.
RocksDB, LevelDB, Google Bigtable

LSM tree is also in memory, same as hash indexes. It can be either red black trees or AVL trees. They are also called memtable.

Inorder traversal and write to disk as sorted string table when it gets full ( a few MBs). Every-time when we write memtable to SSTable, we can discard the corresponding WAL.

In Lucene, indexing engine, the mapping from term to postings list is kept in SSTable like sorted files, which are merged in the background as needed.

DDIA 75

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

LSM tree and SSTable optimization

A

Sparse index - the offsets of keys in the log segment.

Bloom filters - what keys does not appear in the database. Never provide false negative - if it says a key does not exist, it would definitely not exist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is LSM tree + SST is a middle ground between hash index and b-tree

A

Not bounded by memory like hash indexes; can do range query that hash indexes are bad at
Better write performance than b tree while read performance not as good as b tree.
It requires extra CPU bandwidth for compaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is address index? Why do we need it sometimes?

A

Index maps to disk addresses. Like the sparse index we discussed for SSTable.

It’s a type of non clustered index. Don’t store entire row or duplicate data when we have indexes on different column to minimize the data we store.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is multi-dimensional index and why do we need it?

A

Like a combo index; composite index.

Example: geospatial dataset, latitude and longitude.

For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s Generalized Search Tree indexing facility.

An interesting idea is that multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce website you could use a three-dimensional index on the dimensions (red, green, blue) to search for products in a certain range of colors, or in a database of weather observations you could have a two-dimensional index on (date, temperature) in order to efficiently search for all the observations during the year 2013 where the temperature was between 25 and 30℃. With a onedimensional index, you would have to either scan over all the records from 2013 (regardless of temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by timestamp and temperature simultaneously. This technique is used by HyperDex.

DDIA 87

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is atomicity

A

All or nothing principle. All writes for a transaction should succeed or none of them succeed

Example - exchange of money

DDIA 223

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is database consistency

A

If the transaction fails it fails gracefully, no invariants should be broken - foreign key constraint, unique constraint and check constraint.
Check constraint - specific rules beyond basic data types constraint; businesses rules within the database schema.
For example, employee hiring date always greater than date of birth.
In accounting, credits and debits across all accounts should be balanced.

Atomicity focus on transactions are executed as indivisible units;
consistency focus mainly on maintaining the integrity of database in line with its invariants, rules and constraints.

Application replies on database atomicity and isolation properties in order to achieve consistency, but it’s not up to the database alone.

DDIA 225

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is isolation

A

No race conditions.

Concurrently executing transactions are isolated from each other.

DDIA 225

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Durability

A

No data lost.
Once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or database crashes.

DDIA 226

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is dirty writes? Give an example

A

Overwrite uncommitted data.

Two transactions concurrently try to update the same object in a database.
If the earlier write is a part of a transaction that has not yet committed and the later write overwrites an uncommitted value.

DDIA 235

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is dirty read?

A

A transaction has not yet committed or aborted and another transaction see the uncommitted data.

DDIA 234

—————————

Dirty reads are problematic because they can lead to inconsistencies, unreliability, and potential data corruption in a database. When a transaction reads data that has been modified but not yet committed by another transaction, it risks basing its operations on temporary, potentially invalid data. This can result in several issues:

  1. Inconsistencies:
    • If the transaction that made the uncommitted change is rolled back, the dirty read transaction would have read and possibly acted on data that never becomes permanent.
  2. Cascading Rollbacks:
    • If a dirty read occurs and the initial transaction is rolled back, subsequent transactions that read the uncommitted data might also need to be rolled back, causing a cascade of rollbacks.
  3. Unreliability:
    • Applications relying on database transactions may become unreliable if they frequently operate on data that is subject to change. This unpredictability can lead to unexpected behavior and errors.
  4. Potential Data Corruption:
    • In some cases, decisions made based on dirty reads can lead to corrupt data states that are difficult to detect and correct.

Scenario: Bank Account Transfers

  • Transaction T1: User A transfers $500 from their checking account to their savings account.
    • Step 1: Deduct $500 from checking account (uncommitted).
    • Step 2: Add $500 to savings account (uncommitted).
  • Transaction T2: User A checks their total balance across both accounts.
    • Reads checking account balance.
    • Reads savings account balance.

Problem:
- Dirty Read: Transaction T2 reads the balance after the deduction but before the addition.
- Checking account balance: $500 (after deduction).
- Savings account balance: $1000 (before addition).
- Total balance seen by the user: $500 + $1000 = $1500.
- T1 is rolled back due to an error.
- Actual balance should remain $1000 + $1000 = $2000.
- User sees an incorrect total balance of $1500, leading to confusion and possible incorrect financial decisions.

Scenario: E-commerce Inventory System

  • Transaction T1: An item is sold, reducing inventory.
    • Step 1: Reduce item inventory by 1 (uncommitted).
    • Step 2: Confirm sale and commit changes.
  • Transaction T2: A customer checks the availability of the item.
    • Reads the current inventory level.

Problem:
- Dirty Read: Transaction T2 reads the inventory level after the reduction but before the transaction is committed.
- Inventory before sale: 10 units.
- Inventory after uncommitted reduction: 9 units.
- T1 fails and rolls back, keeping the inventory at 10 units.
- T2 reports 9 units available.
- Customer places an order based on incorrect inventory information, potentially leading to overselling and customer dissatisfaction.

Scenario: Real-Time Data Analytics

  • Transaction T1: Data from multiple sensors is aggregated and temporarily stored for real-time analytics.
    • Step 1: Insert sensor data (uncommitted).
    • Step 2: Calculate aggregate values (uncommitted).
  • Transaction T2: A report is generated based on the aggregated data.
    • Reads aggregated data.

Problem:
- Dirty Read: Transaction T2 reads the aggregated data that includes uncommitted sensor data.
- Sensor data before aggregation: 50 readings.
- Aggregated result after uncommitted inserts: 60 readings.
- T1 fails, and the uncommitted sensor data is rolled back.
- Report generated shows incorrect aggregated data, leading to inaccurate analytics and potentially flawed business decisions.

To avoid dirty reads, databases can employ higher isolation levels that prevent transactions from reading uncommitted data. Common isolation levels that prevent dirty reads include:

  1. Read Committed:
    • Transactions can only read data that has been committed by other transactions. This isolation level ensures that dirty reads do not occur.
  2. Repeatable Read:
    • Ensures that if a transaction reads a row, it will see the same data if it reads that row again within the same transaction, preventing non-repeatable reads as well as dirty reads.
  3. Serializable:
    • Provides the strictest level of isolation, ensuring that transactions are completely isolated from each other and preventing all types of read anomalies, including dirty reads, non-repeatable reads, and phantom reads.

Dirty reads can lead to various issues, including data inconsistencies, unreliable applications, cascading rollbacks, and potential data corruption. By using appropriate isolation levels, databases can prevent dirty reads and ensure data integrity and consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the downside of using the same row level locks to prevent dirty reads? What is the common approach used in practice

A

Long running write transaction can force read only transactions to wait until that long running transaction has completed.

Commonly used approach is that databases remember both old committed value and new value by the transaction that holds the lock. Read requests get the old value so that writer does not block reader.

DDIA 237

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do we deal with dead row level locks?

A

Database automatically detect deadlocks and abort one of them so that the others can make progress. The aborted transaction needs to be retried by the application.

DDIA 258

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does it mean that a database implements read committed isolation

A

A DB that blocks both dirty writes and dirty reads.

DDIA 236

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is read skew? Give an example

A

Also called non repeatable read.

We are reading committed data but there is a write executed in the middle of massive read, and therefore some invariant doesn’t hold anymore.

Long analytical read.

DDIA 238

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do we resolve a read skew problem

A

Snapshot isolation - DDIA 238

Assign a monotonic increasing number to each write. Store all the data overridden with a transaction number.

The transaction sees all the data that was committed at the start of the transaction. Subsequent transactions shouldn’t impact.

A key principle is writers never block readers and vice versa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is lost updates

A

Increment the counter.
Read-modify-write.
Atomic write operations.
DDIA 243
———————

A lost update occurs when two transactions read the same data and then both attempt to update it. The changes made by the first transaction are overwritten by the second transaction, resulting in a loss of the first update. This anomaly typically happens in a scenario where two transactions are trying to write to the same data item without proper synchronization.

  1. Transaction 1 (T1) reads a value of 10 from a row (e.g., account balance).
  2. Transaction 2 (T2) reads the same value of 10 from the same row.
  3. T1 updates the value to 15 and commits.
  4. T2 updates the value to 20 and commits.
  5. The final value is 20, and the update made by T1 is lost.

Lost updates are not inherently prevented by read committed isolation. It is handled by repeatable reads isolation with 2 phase locking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is write skew and how to prevent

A

Database invariant is broken due to changes to different rows.

Solution - hold the locks for all rows that can impact the same database variant.

DDIA 247

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is phantoms writes and how to prevent

A

Two writes that create new rows which conflict with each other.

A write in one transaction changes the results of a search query in another transaction. There is no object to which we can attach the locks.

Materializing conflicts

DDIA 251

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is serializability in database

A

When transactions are committed, the result is the same as if they had run serially, even though in reality they may have run concurrently.

Serializability in databases ensures that regardless of how transactions are interleaved in their execution, their effect on the database state is as if they had been executed in some serial order, thus preserving the consistency and correctness of the data. This level of isolation is critical for applications requiring strong consistency guarantees, such as financial services, where transaction integrity is paramount.

DDIA 225

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is actual serial execution and why do we use it

A

Feasible since 2007.

Runs everything in one core with one thread sequentially. After 2007 it’s feasible because RAM became cheap enough and sometimes we can fit entire dataset in memory.

OLTP transactions are usually short and only make a small number of reads and writes Long running OLAP queries can run on consistent snapshot using snapshot isolation outside of the serial execution loop.
Sometimes single threaded execution runs faster than a system that supports concurrency because it get rid of the locking overhead. However the throughout is limited to a single CPU and therefore not scalable.

VoltDB/H-store

DDIA 253

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is stored procedures and why do we use it? What are the shortcomings with it?

A

Part of SQL standard since 1999.

To minimize the human and network call overhead.

System with single threaded serial transaction processing don’t allow interactive multi-statement transactions (multiple http requests).

The application must submit the entire transaction code to the database ahead of time, as a stored procedure. Hard to version control and deploy the code.

DDIA 254

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is two phase locking

A

MySQL. Since 1970s.

Why do we need it? To handle concurrent reads and write to the same piece of data (even in a single node system) so we don’t have dirty reads, dirty writes, Phantom reads or writes.

Read locks are shared but write locks are exclusive. Writer basically blocks all other writers and readers.

Locks are ALWAYS held until transaction commits.

It was the only widely used algorithm for serializability for around 30 yrs, but a big downside is performance.

———————

Two-phase locking (2PL) is a concurrency control method used to ensure serializability, the highest level of isolation in database systems.

Serializability ensures that the execution of transactions is equivalent to some serial order, thereby preventing all types of read and write anomalies, including dirty reads, non-repeatable reads, and phantom reads.

TPL consists of two phases: the growing phase (lock acquisition) and the shrinking phase (lock release).

Phantom reads occur when a transaction retrieves a set of records that satisfy a condition, but during the course of the transaction, another transaction inserts or deletes records that would also satisfy that condition, causing the first transaction to see a different set of records upon re-execution.

Here’s how two-phase locking prevents phantom reads:

  1. Locks on Read Operation: When a transaction reads a set of records that satisfy a condition, two-phase locking ensures that the transaction acquires shared locks (read locks) on all records that satisfy the condition. These locks prevent other transactions from inserting or deleting records that would affect the set of records being read by the current transaction.
  2. Locks on Write Operation: When a transaction inserts or deletes records, it acquires exclusive locks (write locks) on those records. These locks prevent other transactions from reading or writing to the affected records until the transaction holding the locks completes and releases them.

By ensuring that transactions acquire appropriate locks (shared or exclusive) on all records they access or modify during the growing phase and release those locks only during the shrinking phase, two-phase locking prevents other transactions from inserting or deleting records that would cause phantom reads.

In essence, two-phase locking guarantees that the set of records a transaction reads remains unchanged throughout its execution, thereby preventing the occurrence of phantom reads.

Yes, in the context of preventing phantom reads, two-phase locking (2PL) does involve the use of predicate locks, though the term “predicate lock” itself is more commonly associated with other concurrency control mechanisms like predicate-based locking.

Here’s how two-phase locking relates to predicate locks in preventing phantom reads:

  1. Predicate Locks in Two-Phase Locking: In two-phase locking, transactions acquire locks on individual data items (records) either in a shared (read) mode or exclusive (write) mode. When a transaction performs a read operation with a condition (predicate), such as a WHERE clause in SQL, it acquires shared locks on all records that satisfy the condition. These shared locks effectively act as predicate locks because they protect the set of records that satisfy the condition from being modified (inserted or deleted) by other transactions.
  2. Preventing Phantom Reads: Phantom reads occur when a transaction re-executes a query and finds a different set of records due to insertions or deletions by other transactions. By acquiring shared locks on all records that match a predicate during the read operation, two-phase locking ensures that the set of records remains consistent for the duration of the transaction. This prevents other transactions from inserting or deleting records that would affect the results of the current transaction’s query, thereby preventing phantom reads.

In summary, while two-phase locking is not typically described using the term “predicate locks” in the same way as other concurrency control mechanisms, the concept of acquiring shared locks on records that satisfy a predicate condition effectively serves the purpose of preventing phantom reads by ensuring consistent query results across transaction executions.
———————

DDIA 257

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is predicate locks

A

Rather than a lock on an object, it is a lock on patterns, on certain query conditions.

Pretty slow to run to evaluate the full query.

If two phase locking include predicate locks, the database prevent all forms of write skew and other race conditions, so its isolation becomes serializable.

DDIA 259

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is index range locks

A

Simplify the predicate lock by matching a greater set of objects to improve the performance.

DDIA 259

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is database snapshot

A

A consistent state of a database at some point of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What’s serializable snapshot isolation and when should we use it

A

Introduced in 2008. Used by postgreSQL; Oracle; Microsoft SQL.

The optimistic concurrency control.

Transactions continue anyway in the hope that everything will turn out all right. When it wants to commit, the database checks whether anything bad happened. Only transaction that executed serializably are allowed to commit.

Only when most transaction do not get into the way of each other. Don’t use it if two many conflicting transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Column oriented storage
Definition
Use case

A

Parquet is a free and open source column oriented storage format that supports a document data model, since 2013.

The idea is pretty simple - don’t store the values from one row together, stores all the values from each column together instead. Query only needs to read and parse those columns that are used in that query, which can save a lot of work and minimize the disk I/O.

We can perform column compaction to reduce the amount of data we store; we can even store and cache those data in memory to speed up query.

Downside - every write needs to go to many different places on disk.

Can write rows into LSM trees in memory and flush them all to disk and do bulk writes basically column oriented files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Column storage - bitmap compression + run-length encoding

A

Often, the number of distinct values in a column is small compared to the number of rows (for example, a retailer may have billions of sales transactions, but only 100,000 distinct products). We can now take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.

If n is very small (for example, a country column may have approximately 200 distinct values), those bitmaps can be stored with one bit per row. But if n is bigger, there will be a lot of zeros in most of the bitmaps (we say that they are sparse). In that case, the bitmaps can additionally be run-length encoded, as shown at the bottom of the figure below. This can make the encoding of a column remarkably compact.

Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data warehouse. For example:

WHERE product_sk IN (30, 68, 69):

Load the three bitmaps for product_sk = 30, product_sk = 68, and product_sk = 69, and calculate the bitwise OR of the three bitmaps, which can be done very efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Predicate push down

A

A db optimization technique that improves query performance by filtering data before it’s transferred.

Entire dataset don’t need to be scanned during query execution.

Example

Consider a database query intended to find all transactions over $1,000 in a financial database:

SELECT * FROM transactions WHERE amount > 1000;

With predicate pushdown, the database engine would apply the filter amount > 1000 directly when reading the data from disk, rather than fetching all transactions into memory and then filtering them.

Conclusion

Predicate pushdown is a critical optimization in the management and processing of data across various platforms, from traditional RDBMS to modern big data systems. By ensuring that unnecessary data is filtered out early in the data processing pipeline, systems can achieve higher performance and efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the downsides of column oriented storage?

A

Sorted.
Writes to many columns.
Use LSM tree to do bulk update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is forward and backward compatibility

A

Forward: old code can read records that were written by new code.

Backward - new code can read records written by old code. If you were to add a field and make it required, that check would fail if new code read data written by old code, because the old code will not have written the new field that you added. Therefore, to maintain backward compatibility, every field you add after the initial deployment of the schema must be optional or have a default value (automatic backfill or manual backfill)

Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, and so you can explicitly handle it (if necessary by simply keeping the old code to read the old data). Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code.

DDIA 121

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are Apache Thrift and Protocol buffers

A

Apache Thrift and Protocol Buffers (protobuf) are binary encoding libraries that are based on the same principle. Protocol Buffers was originally developed at Google, Thrift was originally developed at Facebook, and both were made open source in 2007–08.

Both Thrift and Protocol Buffers require a schema for any data that is encode. To encode the data in Thrift, you would describe the schema in the
Thrift interface definition language (IDL) like this:

struct Person {

1: required string userName,
2: optional i64 favoriteNumber,
3: optional list<string> interests</string>

}

Each field has a type annotation (to indicate whether it is a string, integer, list, etc.)
and, where required, a length indication (length of a string, number of items in a list).
The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded as ASCII (or rather, UTF-8).

DDIA 117

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What’s the disadvantages of thrift or protocol buffers

A

Require knowledge of scheme at compile time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is Apache avro

A

Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers and Thrift. It was started in 2009 as a subproject of Hadoop, as a result of Thrift not being a good fit for Hadoop’s use cases.

Create database schema on the fly based off of column names.

Binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data.

To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field.

The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible. When data is decoded (read), the Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side and translating the data from the writer’s schema into the reader’s schema.

For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a different order, because the schema resolution matches up the fields by field name. If the code reading the data encounters a field that appears in the writer’s schema but not in the reader’s schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.

DDIA 122

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Why do we need replication

A

Replication means keeping a copy of the same data on multiple machines that are connected via a network. As discussed in the introduction to Part II, there are several reasons why you might want to replicate data:

• To keep data geographically close to your users (and thus reduce latency)

• To allow the system to continue working even if some of its parts have failed (and thus increase availability)

• To scale out the number of machines that can serve read queries (and thus increase read throughput)

Page 151

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is synchronous and asynchronous replication

A

The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure that the data is still available on the follower.

The disadvantage is that if the synchronous follower doesn’t respond (because it has crashed, or there is a network fault, or for any other reason), the write cannot be processed. The leader must block all writes and wait until the synchronous replica is available again.

For that reason, it is impractical for all followers to be synchronous: any one node outage would cause the whole system to grind to a halt. In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous.

If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous. This guarantees that you have an up-to-date copy of the data on at least two nodes: the leader and one synchronous follower. This configuration is sometimes also called semi-synchronous.

Often, leader-based replication is configured to be completely asynchronous. In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are lost. This means that a write is not guaranteed to be durable, even if it has been confirmed to the client.

However, a fully asynchronous configuration has the advantage that the leader can continue processing writes, even if all of its followers have fallen behind. Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless widely used, especially if there are many followers or if they are geographically distributed.

DDIA 153

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How do we implement replication?

A

Logical (row-based) log replication

An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals.

This kind of replication log is called a logical log, to distinguish it from the storage engine’s (physical) data representation.

A logical log for a relational database is usually a sequence of records describing writes to database tables at the granularity of a row:

• For an inserted row, the log contains the new values of all columns.

• For a deleted row, the log contains enough information to uniquely identify the row that was deleted. Typically this would be the primary key, but if there is no primary key on the table, the old values of all columns need to be logged.

• For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns (or at least the new values of all columns that changed).

You know it. Find photos.

DDIA 159

What’s the issue with WAL as replication log?

The main disadvantage is that the log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks. This makes replication closely coupled to the storage engine. If the database changes its storage format from one version to another, it is typically not possible to run different versions of the database software on the leader and the followers.

That may seem like a minor implementation detail, but it can have a big operational impact. If the replication protocol allows the follower to use a newer software version than the leader, you can perform a zero-downtime upgrade of the database software by first upgrading the followers and then performing a failover to make one of the upgraded nodes the new leader. If the replication protocol does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require downtime.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

How do we implement read-after-write consistency

A

How can we implement read-after-write consistency in a system with leader-based replication? There are various possible techniques. To mention a few:

• When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower. This requires that you have some way of knowing whether something might have been modified, without actually querying it. For example, user profile information on a social network is normally only editable by the owner of the profile, not by anybody else. Thus, a simple rule is: always read the user’s own profile from the leader, and any other users’ profiles from a follower.

• If most things in the application are potentially editable by the user, that approach won’t be effective, as most things would have to be read from the leader (negating the benefit of read scaling). In that case, other criteria may be used to decide whether to read from the leader. For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader.

• The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If a replica is not sufficiently up to date, either the read can be handled by another replica or the query can wait until the replica has caught up. The timestamp could be a logical timestamp (something that indicates ordering of writes, such as the log sequence number) or the actual system clock

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Monotonic reads and what’s the edge case it is preventing?

A

In the context of single leader and asynchronous replication; client may see things move backwards in time.

Monotonic reads is a guarantee that this kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go backward— i.e., they will not read older data after having previously read newer data.

One way of achieving monotonic reads is to make sure that each user always makes their reads from the same replica (different users can read from different replicas). For example, the replica can be chosen based on a hash of the user ID, rather than randomly. However, if that replica fails, the user’s queries will need to be rerouted to another replica.

DDIA 165

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Give an example of violation of causality.

A

DDIA 165

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is consistent prefix reads? How is it implemented?

A

All casually dependent writes goes to same replica.

Preventing this kind of anomaly requires another type of guarantee: consistent prefix reads. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order. This is a particular problem in partitioned (sharded) databases. If the database always applies writes in the same order, reads always see a consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different partitions operate independently, so there is no global ordering of writes: when a user reads from the database, they may see some parts of the database in an older state and some in a newer state. One solution is to make sure that any writes that are causally related to each other are written to the same partition—but in some applications that cannot be done efficiently.

DDIA 166

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What’s the challenges with leader failure?

A
  1. How to detect leader is down - timeout settings
  2. Lost write - leader goes down before new writes get propagated to any of the followers
  3. Split brain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

The topologies for multi leaders sync

A

DDIA 175

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is the problem of leaderless replication via gossip protocol

A
  1. It is expensive to ensure causality using lamport clock or detecting concurrent writes using version vectors and resolve conflicts.
  2. Infinite loop of broadcasting messages - for a specific write, maintain a set of visited nodes

DDIA 177

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Last write wins to resolve write conflicts

A

DDIA 186

What timestamp we can use?
Quartz crystal vibration.
Clock skew.
NTP network time protocol - GPS clock
Adjust computer clock

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is concurrent writes in multi-leader system

A

The writes don’t know about each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

How to detect concurrent write in multi-leader system

A

Version vector - number of writes it has seen from each leader node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are the other solutions besides LWW to mitigate concurrent writes problem

A

One option is to store sibling and let user decide later. As merging siblings in application code is complex and error-prone, there are some efforts to design data structures that can perform this merging automatically. For example, Riak’s datatype support uses a family of data structures called CRDTs that can automatically merge siblings in sensible ways, including preserving deletions.

CRDT - conflict free replicated data types, formally introduced in 2011 and now used by Riak, Redis l, AntidoteDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is operational CRDT

A

When we say that “updates are propagated as operations” in the context of operation-based CRDTs (CmRDTs), we’re referring to the method by which changes to data are communicated across different nodes in a distributed system.

Here’s a detailed breakdown of what this entails: Process of Propagating Updates as Operations

1.	Operation Encapsulation: Each change made to the data is encapsulated as a discrete operation. This operation includes all the information necessary to replicate the change on other nodes. For instance, if you’re incrementing a counter, the operation might specify that the counter should be incremented by one.

2.	Transmission: Instead of sending the entire state of the data structure (which could be large), only the operation itself is sent across the network to other nodes. This is typically more efficient, especially when the data structure is large but the changes are small.

3.	Commutative Operations: Operations are designed to be commutative. This means that the order in which operations are applied does not affect the final outcome. For example, if two nodes simultaneously increment the same counter, the operations “increment by 1” can be applied in any order but still yield the same result.

Example

Consider a collaborative text editing application where multiple users can insert or delete characters in a document:

•	User A inserts the letter “a” at position 1.
•	User B deletes the character at position 2.

Each of these actions (insertion and deletion) is an operation. These operations are sent to all other participating nodes:

•	The insertion operation from User A might be encoded as insert('a', 1).
•	The deletion operation from User B might be encoded as delete(2).

When another node receives these operations, it applies them to its local version of the document. Because the operations are commutative and designed to be independent of specific states, they can be applied in any order without resulting in conflicts or inconsistencies.

Benefits

This approach minimizes the amount of data transferred between nodes, reduces latency in response times, and ensures that even if messages arrive out of order or are delayed, the final state of the data structure remains consistent across all nodes. This makes operation-based CRDTs highly effective for real-time, distributed applications where low latency and high reliability are crucial.

However, whenever we have casual relationships amongst our operations and operational CRDT could let us down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Stated based CRDT

A

Updates are managed by sharing the entire state of the data structure rather than individual operations.

Each node independently updates its local state, and periodically, the entire state is merged with states from other nodes using a merge operation that is associative, commutative, and idempotent. This ensures that all nodes eventually converge to the same state, regardless of the order in which states are merged.

A G-Counter is a simple CRDT used to implement a distributed counter that can only increase.

  • Local Update: Each node maintains its own counter and increments its local count independently without coordination with other nodes.
  • Merging: When nodes communicate, they share their entire counter state. The merge operation for a G-Counter takes the maximum of each node’s count for each node in the system, effectively synchronizing the counts across all nodes to the highest value observed.
  • Result: The result is a counter that always increases and never decreases, providing a reliable count even in the presence of network partitions and asynchronous updates.

State-based CRDTs like the G-Counter are particularly useful in distributed systems where it’s important to ensure that all nodes can operate independently and still maintain consistency across the system without complex conflict resolution mechanisms.

State-based CRDTs handle causal relationships by ensuring that the merge operation can combine states from different nodes in a way that reflects the causal order of updates, even though individual nodes might apply updates independently and asynchronously. For operations and data structures where the relationship between updates is not inherently commutative, like a G-Counter, CRDTs must incorporate additional mechanisms to track and reconcile the order of operations. One common approach is to use version vectors or other metadata that capture the causal relationships between updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What database use leaderless replication

A

Cassandra, Riak

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is the context of read repair?

A

Read repair is a technique used in distributed data storage systems to ensure data consistency across different replicas of the same data. This process is particularly important in systems that employ replication to provide high availability and fault tolerance. Read repair helps in keeping the replicas synchronized by actively correcting discrepancies during read operations.

Read repair operates under eventual consistency models, where data is replicated across multiple nodes to ensure availability and resilience against node failures. Here’s how read repair typically functions:

  1. Data Retrieval: When a read request is issued to a distributed database, it may retrieve data from one or more replicas, depending on the consistency requirements specified for the read operation.
  2. Version Comparison: The system compares the versions of the data retrieved from different replicas. This can be done by checking timestamps, version numbers, or other metadata associated with the data to determine which replica has the most recent version.
  3. Discrepancy Detection: If discrepancies are found (i.e., some replicas have older versions of the data), the system identifies the most up-to-date version of the data.
  4. Correction: The system then propagates the correct version back to those replicas that had outdated or incorrect versions, thus “repairing” the data on those nodes.
  5. Update of Replicas: The outdated replicas update their data to reflect the most recent version, ensuring consistency across the system.

Read repair can be triggered in a few different ways, depending on the system’s design and configuration:

  • Synchronous Read Repair: This happens as part of the read process. The system checks and repairs any inconsistencies before returning the data to the requester. This ensures strong consistency but can increase the latency of read operations because the system must wait until all necessary repairs are made.
  • Asynchronous Read Repair: In this approach, the system logs the found discrepancies during the read operation but does not immediately correct them. Instead, the repairs are done in the background, allowing the read operation to complete with lower latency. This method favors availability and performance over immediate consistency.

Read repair is widely used in various distributed data stores, including:

  • Cassandra: Uses read repair as part of its normal operations to ensure that all replicas for a given piece of data remain consistent over time. Cassandra allows configuring the probability of read repair happening on a per-read basis.
  • DynamoDB: Amazon’s DynamoDB uses a form of read repair in its background processes to maintain data consistency across its distributed storage.
  • Riak: Also implements read repair to handle discrepancies found during read operations, ensuring that data returned to the user is the most current and that all replicas are eventually consistent.

Read repair is a crucial mechanism in distributed databases to maintain data consistency across replicas. By correcting discrepancies during read operations, read repair helps ensure that all replicas converge towards a consistent state, thereby fulfilling the promises of reliability and availability in distributed systems. This technique is particularly valuable in environments where data is continuously updated and accessed across multiple nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the context of anti-entropy and how is it implemented

A

Anti-entropy is a concept used in distributed systems to ensure data consistency across replicas. It involves mechanisms that detect and correct discrepancies in data at different nodes, effectively reducing “entropy” which in this context refers to disorder or inconsistencies in the system. Anti-entropy processes help in synchronizing data across a distributed network by continuously or periodically comparing data sets and resolving differences.

In distributed data storage systems, replicas of the same data may become inconsistent due to network failures, temporary partitioning, or delays in update propagation. Anti-entropy is a proactive approach to handle these potential inconsistencies and ensure that all replicas converge to the same state, even in the face of failures.

  1. Merkle Trees:
    • Concept: A Merkle tree is a binary tree in which every leaf node is labelled with the hash of a data block, and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes. Merkle trees are used to efficiently compare data sets across different nodes.
    • Usage: When two nodes need to synchronize their data, they can exchange the root hashes of their Merkle trees. If the hashes differ, they can compare smaller segments of their trees (i.e., branches) progressively, until they localize the blocks that differ. This process significantly reduces the amount of data that needs to be transferred for synchronization.
    • Example: Cassandra uses Merkle trees during its anti-entropy repair process to identify and synchronize differences between replicas.
  2. Read Repair and Write Repair:
    • Read Repair: As previously explained, read repair involves detecting and correcting inconsistencies during read operations. This is a form of anti-entropy because it actively seeks to eliminate divergences by updating replicas with the most recent data version when a discrepancy is detected during a read.
    • Write Repair: Similar to read repair, but corrections are made during write operations. When new data is written or existing data is updated, the system ensures that all replicas are immediately brought up to date, hence maintaining consistency.
  3. Gossip Protocols:
    • Concept: Gossip protocols involve nodes periodically sharing state information about themselves and other nodes they know about in a manner akin to how gossip spreads in social networks.
    • Usage: Each node in a distributed system periodically exchanges state information with a randomly chosen other node. This state information typically includes data about the node’s view of the system and can include version vectors or other metadata that track data state.
    • Purpose: The continuous, random mixing of state information helps in quickly propagating updates across all nodes and correcting any discrepancies, thereby maintaining system consistency over time.
  • Data Consistency: Ensures all nodes in a distributed system eventually hold the same data, despite any failures or inconsistencies.
  • Fault Tolerance: Increases the resilience of the system to network failures, data corruption, or other anomalies.
  • Scalability: Allows the system to scale effectively, as the mechanisms can handle the synchronization of data across a large number of nodes without central coordination.

Anti-entropy mechanisms are essential in distributed systems to manage data consistency proactively. By continuously addressing and correcting data discrepancies, these mechanisms help ensure that the system operates reliably and efficiently, even as it scales and faces various operational challenges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Quorum writes and reads

A

You know it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What is the edge case that quorum consistency may fail

A

If a sloppy quorum is used, the w writes may end up on different nodes than the r reads, so there is no longer a guaranteed overlap between the r nodes and the w nodes.

Still prone to concurrent writes until all nodes reach distributed consensus.

Writes failed at some nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is strong consistency?

A

Everyone reading the same time would agree on an identical and most up to date value for a given key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is sloppy quorum and hinted handoff

A

In a large cluster (with significantly more than n nodes) it’s likely that the client can connect to some database nodes during the network interruption, just not to the nodes that it needs to assemble a quorum for a particular value.

In that case, database designers face a trade-off:

• Is it better to return errors to all requests for which we cannot reach a quorum of w or r nodes?
• Or should we accept writes anyway, and write them to some nodes that are reachable but aren’t among the n nodes on which the value usually lives?

The latter is known as a sloppy quorum: writes and reads still require w and r successful responses, but those may include nodes that are not among the designated n “home” nodes for a value.

By analogy, if you lock yourself out of your house, you may knock on the neighbor’s door and ask whether you may stay on their couch temporarily. Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate “home” nodes. This is called hinted handoff. (Once you find the keys to your house again, your neighbor politely asks you to get off their couch and go home.)

Thus, a sloppy quorum actually isn’t a quorum at all in the traditional sense. It’s only an assurance of durability, namely that the data is stored on w nodes somewhere. There is no guarantee that a read of r nodes will see it until the hinted handoff has completed.

Sloppy quorums are optional in all common Dynamo implementations. In Riak they are enabled by default, and in Cassandra and Voldemort they are disabled by default

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Why do we need database partitioning

A
  1. Data volume
  2. Scalability - large dataset can be distributed across many disks and query load can be distributed across many processors.

Replication is having multiple copies of the same data on different nodes. For very large datasets, or very high query throughput, that is not sufficient: we need to break the data up into partitions, also known as sharding.

The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster, and the query load can be distributed across many processors.

Partitioned databases were pioneered in the 1980s by products such as Teradata and Tandem NonStop SQL, and more recently rediscovered by NoSQL databases and Hadoop-based data warehouses.

Some systems are designed for transactional workloads, and others for analytics; this difference affects how the system is tuned, but the fundamentals of partitioning apply to both kinds of workloads.

DDIA 199

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Range based partitioning; pros and cons

A

Database locality for range scan;

Hot spot - last name starting with x or z: a to c partition.

If you know the boundaries between the ranges, you can easily determine which partition contains a given key. If you also know which partition is assigned to which node, then you can make your request directly to the appropriate node (or, in the case of the encyclopedia, pick the correct book off the shelf).

However, the downside of key range partitioning is that certain access patterns can lead to hot spots. If the key is a timestamp, then the partitions correspond to ranges of time—e.g., one partition per day. Unfortunately, because we write data from the sensors to the database as the measurements happen, all the writes end up going to the same partition (the one for today), so that partition can be overloaded with writes while others sit idle.

To avoid this problem in the sensor database, you need to use something other than the timestamp as the first element of the key. For example, you could prefix each timestamp with the sensor name so that the partitioning is first by sensor name and then by time. Assuming you have many sensors active at the same time, the write load will end up more evenly spread across the partitions. Now, when you want to fetch the values of multiple sensors within a time range, you need to perform a separate range query for each sensor name.

DDIA 202

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Hash range based partitioning

A

Each partition assigned a range of hash keys.
Even distribution of keys, less hot spot
No data locality for range queries

Keys that were once adjacent are now scattered across all the partitions, so their sort order is lost. In MongoDB, if you have enabled hash-based sharding mode, any range query has to be sent to all partitions.

Range queries on the primary key are not supported by Riak, Couchbase, or Voldemort.

Cassandra achieves a compromise between the two partitioning strategies. A table in Cassandra can be declared with a compound primary key consisting of several columns.

Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables. A query therefore cannot search for a range of values within the first column of a compound key, but if it specifies a fixed value for the first column, it can perform an efficient range scan over the other columns of the key.

Bottom of DDIA 203

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Why is hash range partitioning still not perfectly solving hot spot

A

If certain keys are extremely active. One celebrity has too many comments for her instagram posts.

Today, most data systems are not able to automatically compensate for such a highly skewed workload, so it’s the responsibility of the application to reduce the skew.

For example, if one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key. Just a two-digit decimal random number would split the writes to the key evenly across 100 different keys, allowing those keys to be distributed to different partitions.

However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all 100 keys and combine it.

This technique also requires additional bookkeeping: it only makes sense to append the random number for the small number of hot keys; for the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you also need some way of keeping track of which keys are being split.

Perhaps in the future, data systems will be able to automatically detect and compensate for skewed workloads; but for now, you need to think through the trade-offs for your own application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

How to optimize the hash range key partitioning for range query

A

Having secondary indexes.

A secondary index usually doesn’t identify a record uniquely but rather is a way of searching for occurrences of a particular value: find all actions by user 123, find all articles containing the word hogwash, find all cars whose color is red, and so on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Local secondary indexes

A

Second copy of sorted dataset on each partition. Pro: No need to perform any extra writes over the network. Con: Need to fetch from every single partition.

In this indexing approach, each partition is completely separate: each partition maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions.

Whenever you need to write to the database—to add, remove, or update a document—you only need to deal with the partition that contains the document ID that you are writing. For that reason, a document-partitioned index is also known as a local index (as opposed to a global index, described in the next section).

However, reading from a document-partitioned index requires care: unless you have done something special with the document IDs, there is no reason why all the cars with a particular color or a particular make would be in the same partition. Thus, if you want to search for red cars, you need to send the query to all partitions, and combine all the results you get back. This approach to querying a partitioned database is sometimes known as scatter/ gather, and it can make read queries on secondary indexes quite expensive. Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification (see “Percentiles in Practice” on page 16).

Nevertheless, it is widely used: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB all use document-partitioned secondary indexes.

Most database vendors recommend that you structure your partitioning scheme so that secondary index queries can be served from a single partition, but that is not always possible, especially when you’re using multiple secondary indexes in a single query (such as filtering cars by color and by make at the same time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Global secondary indexes

A

Take the aggregated dataset across partitions and we split that into pieces which are stored across nodes.

A global index must also be partitioned, but it can be partitioned differently from the primary key index.

The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants. However, the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node). In an ideal world, the index would always be up to date, and every document written to the database would immediately be reflected in the index. However, in a termpartitioned index, that would require a distributed transaction across all partitions affected by a write, which is not supported in all databases. In practice, updates to global secondary indexes are often asynchronous (that is, if you read the index shortly after a write, the change you just made may not yet be reflected in the index).

For example, Amazon DynamoDB states that its global secondary indexes are updated within a fraction of a second in normal circumstances, but may experience longer propagation delays in cases of faults in the infrastructure.

73
Q

Why do we need 2 phase commits

A

Atomicity for distributed transactions

One transaction which involves cross partition writes, especially when writes to two different tables which are partitioned differently.

Global secondary index.

74
Q

How is two phase commit implemented

A

Coordinator; prepare request and track responses from all participants; commit or abort requests

2PC uses a new component that does not normally appear in single-node transactions: a coordinator (also known as transaction manager). The coordinator is often implemented as a library within the same application process that is requesting the transaction (e.g., embedded in a Java EE container), but it can also be a separate process or service. Examples of such coordinators include Narayana, JOTM, BTM, or MSDTC.

A 2PC transaction begins with the application reading and writing data on multiple database nodes, as normal. We call these database nodes participants in the transaction. When the application is ready to commit, the coordinator begins phase 1: it sends a prepare request to each of the nodes, asking them whether they are able to commit.

The coordinator then tracks the responses from the participants:
• If all participants reply “yes,” indicating they are ready to commit, then the coordinator persists a new entry into its commit log, then sends out a commit request in phase 2, and the commit actually takes place.
• If any of the participants replies “no,” the coordinator sends an abort request to all nodes in phase 2.

This process is somewhat like the traditional marriage ceremony in Western cultures: the minister asks the bride and groom individually whether each wants to marry the other, and typically receives the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the couple husband and wife: the transaction is committed, and the happy fact is broadcast to all attendees. If either bride or groom does not say “yes,” the ceremony is aborted.

DDIA 356

75
Q

Cons of two phase commits

A

Coordinator failures;

Participants failures; If this request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back: if the decision was to commit, that decision must be enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the transaction will be committed when it recovers—since the participant voted “yes,” it cannot refuse to commit when it recovers.

76
Q

Why we should avoid distributed transactions

A

Not fault tolerant due to problems of two phase commit. Typically slow.

77
Q

Why do we need consistent hashing

A

Minimize data movement over network. Great for partitioning and load balancing.

Consistent hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring.

This method is particularly useful in distributed systems because it minimizes the number of reassignments when a server is added or removed, thus providing better scalability and manageability.

In consistent hashing, both the data items (keys) and servers (nodes) are hashed onto a hash ring using a hash function. The hash function typically produces a fixed range of numerical outputs (for example, a 32-bit integer space). Here’s a step-by-step explanation of how consistent hashing works:

  1. Hashing Nodes and Data:
    • Each node in the system is assigned a hash value by hashing some characteristic (like its IP address or hostname). These values are treated as positions on a circular hash ring.
    • Data keys are also hashed using the same hash function, mapping them to positions on the same ring.
  2. Assigning Data to Nodes:
    • To determine which node is responsible for a given key, the system finds the nearest node whose hash value is greater than or equal to the key’s hash. This node is often referred to as the “successor” node of the key.
    • If the key’s hash is greater than the hash of any node, the key is mapped to the node with the lowest hash value due to the circular nature of the ring.
  3. Handling Node Changes:
    • Adding a Node: When a new node joins the network, only the keys that are mapped between the new node’s hash and the hash of its immediate predecessor node need to be reassigned to the new node.
    • Removing a Node: When a node leaves or fails, its keys are reassigned to the node that immediately follows it on the hash ring.

To improve the load distribution and reassignment efficiency, consistent hashing often employs a concept called “virtual nodes”:
- Virtual Nodes (or Vnodes): Each node can be represented multiple times on the hash ring by creating multiple virtual nodes (hashes) for each server. These virtual nodes are spread out around the hash ring.
- By using virtual nodes, the system can balance the load more evenly, especially when the hash function doesn’t distribute nodes uniformly across the hash ring.
- When a physical node is added or removed, only a small proportion of keys are reassigned, because each physical node is responsible for multiple points on the ring.

Consistent hashing is widely used in various distributed systems and caching solutions, including:
- Distributed Caches: Systems like Memcached use consistent hashing to distribute cache items across multiple servers.
- Distributed Databases: NoSQL databases like Apache Cassandra and DynamoDB use consistent hashing to distribute data across multiple nodes, ensuring even data distribution and efficient node scaling.

  • Scalability: It significantly reduces the need for reshuffling all keys with every change in the set of servers.
  • Load Balancing: Even distribution of keys helps in load balancing across the servers.
  • Fault Tolerance: Minimizes disruptions when nodes are added or removed, providing better fault tolerance.

Consistent hashing is a cornerstone technology in distributed systems, addressing key challenges related to scalability and performance. Its ability to minimize reorganization and rebalance loads efficiently makes it highly suitable for large-scale, dynamic environments.

78
Q

Fixed total number of partitions

A

Operationally simpler.
Choosing a right number is difficult.
Too many -> overhead of management
Two few -> partition becomes large, rebalancing and recovery becomes expensive.

A fixed number of partitions is operationally simpler, and so many fixed-partition databases choose not to implement partition splitting.

Thus, the number of partitions configured at the outset is the maximum number of nodes you can have, so you need to choose it high enough to accommodate future growth. However, each partition also has management overhead, so it’s counterproductive to choose too high a number.

Choosing the right number of partitions is difficult if the total size of the dataset is highly variable (for example, if it starts small but may grow much larger over time). Since each partition contains a fixed fraction of the total data, the size of each partition grows proportionally to the total amount of data in the cluster. If partitions are very large, rebalancing and recovery from node failures become expensive. But if partitions are too small, they incur too much overhead. The best performance is achieved when the size of partitions is “just right,” neither too big nor too small, which can be hard to achieve if the number of partitions is fixed but the dataset size varies.

DDIA 211

79
Q

Dynamic partitioning and caveat

A

For databases that use key range partitioning, a fixed number of partitions with fixed boundaries would be very inconvenient: if you got the boundaries wrong, you could end up with all of the data in one partition and all of the other partitions empty. Reconfiguring the partition boundaries manually would be very tedious.

For that reason, key range–partitioned databases such as HBase and RethinkDB create partitions dynamically. When a partition grows to exceed a configured size (on HBase, the default is 10 GB), it is split into two partitions so that approximately half of the data ends up on each side of the split. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This process is similar to what happens at the top level of a B-tree. Each partition is assigned to one node, and each node can handle multiple partitions, like in the case of a fixed number of partitions. After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load. In the case of HBase, the transfer of partition files happens through HDFS, the underlying distributed filesystem.

An advantage of dynamic partitioning is that the number of partitions adapts to the total data volume. If there is only a small amount of data, a small number of partitions is sufficient, so overheads are small; if there is a huge amount of data, the size of each individual partition is limited to a configurable maximum.

HBase and MongoDB allow an initial set of partitions to be configured on an empty database (this is called pre-splitting). In the case of key-range partitioning, pre-splitting requires that you already know what the key distribution is going to look like.

Dynamic partitioning is not only suitable for key range–partitioned data, but can equally well be used with hash-partitioned data. MongoDB since version 2.4 supports both key-range and hash partitioning, and it splits partitions dynamically in either case.

DDIA 212

80
Q

What is linearizable databases

A

This is the idea behind linearizability (also known as atomic consistency, strong consistency, immediate consistency, or external consistency).

The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas in reality, the application does not need to worry about them.

In a linearizable system, as soon as one client successfully completes a write, all clients reading from the database must be able to see the value just written. Maintaining the illusion of a single copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and doesn’t come from a stale cache or replica. In other words, linearizability is a recency guarantee. To clarify this idea, let’s look at an example of a system that is not linearizable.

DDIA 324

81
Q

How do we order writes

A

Single leader replication - replication log

Multi leader replication - version vectors and lamport clocks

82
Q

Lamport timestamps

A

DDIA 345

Every node and client keeps track of the maximum counter value it has seen so far, and includes that maximum on every write requests.

As long as the maximum counter is carried along with every operation, this scheme ensures that ordering from the lamport timestamp is consistent with causality, because casual dependency results in an increased lamport timestamp.

————————————-
There is actually a simple method for generating sequence numbers that is consistent with causality. It is called a Lamport timestamp, proposed in 1978 by Leslie Lamport, in what is now one of the most-cited papers in the field of distributed systems.

The use of Lamport timestamps is illustrated in Figure below. Each node has a unique identifier, and each node keeps a counter of the number of operations it has processed. The Lamport timestamp is then simply a pair of (counter, node ID). Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, each timestamp is made unique.

A Lamport timestamp bears no relationship to a physical time-of-day clock, but it provides total ordering: if you have two timestamps, the one with a greater counter value is the greater timestamp; if the counter values are the same, the one with the greater node ID is the greater timestamp.

The key idea about Lamport timestamps, which makes them consistent with causality, is the following: every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

As long as the maximum counter value is carried along with every operation, this scheme ensures that the ordering from the Lamport timestamps is consistent with causality, because every causal dependency results in an increased timestamp.

Lamport timestamps are sometimes confused with version vectors. Although there are some similarities, they have a different purpose: version vectors can distinguish whether two operations are concurrent or whether one is causally dependent on the other, whereas Lamport timestamps always enforce a total ordering. From the total ordering of Lamport time stamps, you cannot tell whether two operations are concurrent or whether they are causally dependent. The advantage of Lamport timestamps over version vectors is that they are more compact.

The problem here is that the total order of operations only emerges after you have collected all of the operations. If another node has generated some operations, but you don’t yet know what they are, you cannot construct the final ordering of operations: the unknown operations from the other node may need to be inserted at various positions in the total order.

83
Q

Total order broadcast

A

DDIA 348

Every node has to agree on the order of writes. In the face of faults, we cannot lose any writes.

Partitioned databases with a single leader per partition often maintain ordering only per partition, which means they cannot offer consistency guarantees (e.g., consistent snapshots, foreign key references) across partitions. Total ordering across all partitions is possible, but requires additional coordination

Total order broadcast is usually described as a protocol for exchanging messages between nodes. Informally, it requires that two safety properties always be satisfied:

Reliable delivery
No messages are lost: if a message is delivered to one node, it is delivered to all nodes.

Totally ordered delivery
Messages are delivered to every node in the same order. A correct algorithm for total order broadcast must ensure that the reliability and ordering properties are always satisfied, even if a node or the network is faulty. Of course, messages will not be delivered while the network is interrupted, but an algorithm can keep retrying so that the messages get through when the network is eventually repaired (and then they must still be delivered in the correct order)

84
Q

Raft

A

It is a consensus algorithm designed to manage a replicated log across distributed systems.

Logs are ordered and inherently linearizable.

85
Q

Raft leader election

A

Raft is a consensus algorithm designed for managing a replicated log across distributed systems. It’s known for its understandability and the way it handles “leader election” as part of ensuring that there’s a consistent system state across all nodes in the cluster. Here’s a detailed explanation of the steps involved in the Raft leader election process:

  • All nodes start as followers: When servers start up, they begin as followers in the Raft protocol. Each node remains in follower state unless it times out (i.e., does not hear from a leader or receive valid heartbeats within a certain period).
  • Timeout and candidacy: If a follower does not receive communication from the leader within a random election timeout period, it assumes there is no active leader and transitions to a candidate state to start an election. The election timeout is randomized for each follower to reduce the chances of two nodes starting their elections at the same time.
  • Increment term: Upon transitioning to the candidate state, the node increments its current term (a continuously growing number representing the current election round).
  • Vote for self: The candidate votes for itself first and then sends out RequestVote RPCs (remote procedure calls) to all other nodes in the cluster, asking them to vote for it.
  • Majority needed: The candidate must gather a majority of votes from the cluster to become the leader. This majority ensures that there is agreement among the cluster majority.
  • Handling split votes: If votes are split and no candidate wins a majority, a new election term begins, and nodes will wait for another election timeout to trigger a new election.
  • Becoming a leader: If a candidate receives a majority of votes from the cluster, it transitions to the leader state. It begins to send regular heartbeats to all followers to assert its authority and prevent new elections.
  • Reset if no majority: If the candidate does not receive a majority (due to a split vote or other issues), the election term ends, and nodes will reset their election timers. The process might repeat until a leader is elected.
  • Heartbeats: Once a leader is established, it regularly sends heartbeat messages to all followers to maintain its authority and prevent followers from triggering new elections. These heartbeats also help in log replication processes, although they carry no log entries themselves.
  • Stepping down: If, at any point, a leader or candidate discovers a higher term number (either from another node’s RPC or during communications), it immediately reverts to follower state. This behavior ensures that the node respects the newer, possibly more up-to-date authority.

This leader election process in Raft ensures that the system can recover from failures and maintain a consistent state across distributed components. The design of Raft, particularly with its strong leader and structured approach to elections and term management, makes it well-suited for a variety of distributed systems that require robust fault tolerance and data consistency.

86
Q

Why ram is short for random access?

A

Direct addressing capability.

87
Q

What exactly is database index

A

An index an additional data structure that is derived from the primary data. Well chosen indexes speed up read queries but every index slows down writes.

88
Q

Operational CRDT downsides

A

When there are casual relationships between operations and their order matters, there could be problem. We need a casually consistent broadcasting system where messages do not get dropped and they also don’t get duplicated.

Idempotent - you can do things many times and make no difference.

89
Q

How does state based CRDT handles casual relationship

A

Handling both addition and removal operations in a state-based CRDT (CvRDT) can be complex because simple state-based data structures like Grow-Only Sets (G-Sets) or counters do not support the removal of elements directly. To manage this, a more sophisticated type of CRDT is used, typically known as a 2P-Set (Two-Phase Set) or a PN-Counter (Positive-Negative Counter) for scenarios involving counters.

The 2P-Set is a state-based CRDT designed to handle both additions and deletions:

  1. Two Sets: It comprises two sets:
    • Add Set (A-Set): Stores all elements that have been added.
    • Remove Set (R-Set): Stores all elements that have been removed.
  2. Add Operation: When an element is added, it is inserted into the A-Set.
  3. Remove Operation: When an element is removed, it is added to the R-Set. For an element to be considered removed, it must be present in both the A-Set and the R-Set.
  4. Membership Test: To determine if an element is in the 2P-Set, check if it is in the A-Set and not in the R-Set.
  5. Merge Operation: Merging two 2P-Sets involves taking the union of both the A-Sets and the R-Sets of the two CRDT instances.

This approach ensures that once an element has been added and subsequently removed, it cannot be re-added, as its presence in the R-Set permanently excludes it from membership, reflecting the “removal” operation.

For counters that need to handle increments and decrements (additions and removals), a PN-Counter can be used:

  1. Two Counters: Consists of two separate counters:
    • P-Counter: For positive increments.
    • N-Counter: For negative decrements.
  2. Increment Operation: Increments are applied to the P-Counter.
  3. Decrement Operation: Decrements are applied to the N-Counter.
  4. Value Calculation: The actual value of the PN-Counter is computed as the difference between the P-Counter and the N-Counter.
  5. Merge Operation: Merging involves taking the element-wise maximum of both P-Counters and N-Counters from two PN-Counter instances.

The basic 2P-Set does not allow the re-adding of elements once they are removed because the R-Set permanently records their removal. To overcome this limitation and handle more complex scenarios (like re-adding previously removed items), a more advanced type of CRDT called an OR-Set (Observed-Removed Set) can be used:

  • OR-Set: Each element in the OR-Set is associated with a unique tag when added, and instead of a single R-Set, a collection of tags is maintained for removed elements. This setup allows elements to be re-added with a new tag, distinguishing different instances of the same element over time.

These enhancements to the basic state-based CRDT models ensure that they can effectively handle the dynamics of add and remove operations in distributed systems, providing flexibility and robustness needed for real-world applications.

90
Q

What happens to WAL when transaction is partially done with some writes made and the aborted due to subsequent writes failures?

A

In the scenario where a transaction needs to be aborted after a write has already been recorded in the Write-Ahead Log (WAL) but a subsequent write within the same transaction fails, the typical approach is not to remove the original entry from the WAL. Instead, the system handles this by adding an additional entry to indicate the abort operation. Here’s how this process generally works:

  1. WAL Entries Are Immutable:
    • Once a write operation is recorded in the WAL, it is considered immutable. This is crucial for ensuring the reliability of the log, especially for recovery purposes. Removing entries from the WAL could lead to inconsistencies and complicate the recovery process.
  2. Logging the Abort:
    • If a transaction needs to be aborted, the database system writes an “abort” entry into the WAL. This entry indicates that the transaction has been rolled back.
    • During recovery, or when processing the WAL entries during normal operations, the presence of an abort entry means that the database should disregard the effects of any writes logged by the aborted transaction. If the changes were applied to the database before the abort (due to write-ahead logging), the system will use the WAL to undo those changes to ensure the database reflects a consistent state.
  3. Undo Operations:
    • Most database systems maintain enough information in the WAL to undo changes made by aborted transactions. This typically involves reversing the changes made by the transaction using the before-images or undo records stored in the WAL.
  4. Transaction Rollback:
    • When the abort entry is processed, the system performs a rollback operation. This includes undoing any changes that were applied to the database but not yet made permanent. The rollback operation ensures that the database remains consistent, reflecting only the effects of successfully committed transactions.

Consider a transaction that involves multiple write operations. Suppose the first write succeeds and is logged in the WAL, but the second write fails due to a constraint violation or some other issue:
- The system will log an abort entry into the WAL.
- Any changes made by the first write operation will be undone using the information from the WAL, maintaining database consistency.
- The transaction’s locks are released, and any resources held by the transaction are cleaned up.

  • Reliability and Simplicity: Keeping the WAL entries immutable simplifies the design of the recovery process. The WAL needs to be a complete and accurate record of all attempts to modify the database, whether successful or not.
  • Consistency: By writing abort entries and supporting comprehensive undo operations, the system ensures that it can always return to a consistent state, even after crashes or errors.

In database systems using WAL, the standard practice for handling transaction aborts after some operations have been logged is to add an abort entry to the WAL rather than attempting to remove or alter existing entries. This approach maintains the integrity and consistency of the WAL and by extension, the database itself, ensuring that it can recover correctly from any point of failure.

91
Q

The process of a write to db

A

Summary of the Process

  • Transaction Begins: A transaction starts, making provisional writes in volatile memory.
  • Operations Executed: Various data manipulation operations occur within the context of the transaction.
  • Pre-Commit State: Changes are prepared but not yet permanent, ensuring they can be rolled back if needed.
  • Commit: If all operations are valid, the transaction is committed—changes are written to durable storage, making them permanent. This involves WAL logging. This ensures that once a transaction is complete, its effects are persistently stored and can be recovered in case of system restarts. This process often involves mechanisms like Write-Ahead Logging (WAL), where changes are first recorded in a log file before they are applied to the database.
  • Post-Commit: Once committed, the changes become an integral part of the database, visible and persistent across sessions and system restarts.

These concepts form the backbone of transaction processing systems and are critical for ensuring that databases remain reliable, consistent, and robust against both logical and physical failures.

92
Q

Fixed number of partitions per node

A

A third option, used by Cassandra is to have a fixed number of partitions per node.

In this case, the size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again.

Since a larger data volume generally requires a larger number of nodes to store, this approach also keeps the size of each partition fairly stable.

When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place. The randomization can produce unfair splits, but when averaged over a larger number of partitions (in Cassandra, 256 partitions per node by default), the new node ends up taking a fair share of the load from the existing nodes.

Cassandra 3.0 introduced an alternative rebalancing algorithm that avoids unfair splits. Picking partition boundaries randomly requires that hash-based partitioning is used (so the boundaries can be picked from the range of numbers produced by the hash function). Indeed, this approach corresponds most closely to the original definition of consistent hashing. Newer hash functions can achieve a similar effect with lower metadata overhead.

93
Q

Shortcomings of dynamic partitioning

A

Data redistribution and rebalancing.
Resource intensive.
If the partitioning algorithm is not good, frequent redistribution could impact the system performance and increase latency.

  1. Complexity of Data Management:
    • Data Redistribution: when nodes are added or removed, dynamic partitioning often requires data to be redistributed across the existing nodes to maintain balance, which is resource-intensive. Also ensuring data consistency during the redistribution process can be challenging, especially in systems that require strong consistency guarantees.
  2. Overhead of Rebalancing:
    • Performance Impact: Rebalancing data to new or existing nodes involves moving large amounts of data across the network, which can significantly impact system performance and increase latency.
    • Operational Disruption: During rebalancing, the system may experience temporary degradation in performance, which could affect application responsiveness and user experience.
  3. Increased Latency:
    • Network Delays: In distributed systems, particularly those spread across geographically dispersed locations, network latency can significantly affect the time it takes to access data from different partitions.
    • Uneven Data Distribution: If the partitioning algorithm does not distribute data evenly, some nodes may end up with a higher load, leading to increased response times for queries that access these nodes.
  4. Hotspots and Skewed Workloads:
    • Dynamic partitioning algorithms may not perfectly distribute data, especially in cases where data access patterns are uneven or change over time. This can lead to “hotspots” where certain nodes handle a disproportionate amount of traffic, leading to performance bottlenecks.
  5. Partitioning Algorithm Sensitivity:
    • The effectiveness of dynamic partitioning heavily depends on the partitioning algorithm used. Poorly designed algorithms can lead to inefficient data distribution, frequent rebalancing, or increased conflict rates in transactional systems.
  6. Handling Node Failures:
    • Fault Tolerance: While dynamic partitioning can improve fault tolerance by distributing data across many nodes, managing the failover process can be complex. Ensuring that data is not lost when nodes fail and that the system can quickly recover and redistribute the load adds to the complexity.
    • Data Recovery: Recovering data and rebalancing the system after a node failure can be a time-consuming process, during which system performance might be compromised.
  7. Cost:
    • The infrastructure and operational costs associated with dynamically partitioned systems can be higher due to the need for more sophisticated management tools, monitoring solutions, and the overhead of managing a distributed architecture.

While dynamic partitioning presents challenges, it is still highly effective for specific types of applications, especially those requiring high scalability, such as large-scale web applications, real-time data processing systems, and distributed databases. The key is to carefully design the partitioning logic and choose suitable tools that can manage the complexities involved effectively.

In conclusion, while dynamic partitioning offers significant advantages for scalability and flexibility, it introduces complexities related to data management, rebalancing overheads, and system performance. Careful design, robust implementation of partitioning algorithms, and efficient operational practices are essential to mitigate these shortcomings and leverage the full potential of dynamic partitioning in distributed systems.

94
Q

Are replication logs and WAL the same thing?

A

In many distributed databases, WAL and replication logs can be integrated into a single mechanism to simplify operations and reduce overhead.

However, some systems might keep them separate, depending on the architectural requirements and specific functionalities needed:

1.	Integrated Approach:
•	Systems like PostgreSQL with streaming replication utilize WAL for both recovery and as a replication log. In such setups, the same WAL files are used to recover the primary database and to stream changes to replicas.
•	This approach ensures consistency and simplifies the logging architecture by using a single unified log.
2.	Separate Logs:
•	In some architectures, especially where different requirements exist for local transaction logging and remote data replication, WAL and replication logs might be kept separate.
•	For example, a system might use WAL for detailed transaction recovery and a separate streamlined replication log optimized for network transmission for updating replicas.

Key Considerations:

•	Performance vs. Consistency: Integrated logs can optimize performance and ensure consistency but may require more bandwidth and processing power for transmitting detailed logs over the network.
•	Scalability and Complexity: Separate logs allow fine-tuning and optimization specific to recovery and replication processes but at the cost of increased system complexity.
95
Q

Real life examples of linearizability

A

Linearizability is a strong form of consistency that ensures all operations on a system appear instantaneously and atomically, preserving the order in which they occur. This model is critical for systems where operations must reflect a single, coherent order across all nodes, ensuring that any read reflects the most recent write. Systems requiring linearizability are often those where data accuracy and consistency are paramount for the correct operation and where users or processes might access the data concurrently from multiple locations.

  1. Financial Systems:
    • Banks and Trading Platforms: These require linearizability to accurately process transactions such as transfers, payments, and trades without discrepancies. For example, when transferring money between accounts, it is crucial that the debit from one account and the credit to another appear instantaneously to ensure that the accounts always reflect a correct and legal state.
    • ATM Transactions: Ensuring that a balance check immediately after a withdrawal reflects the withdrawn amount.
  2. E-commerce Platforms:
    • Inventory Management: Ensuring that the inventory count is accurate when items are purchased online. Linearizability prevents the system from selling more items than are in stock, which is crucial during high-traffic periods like sales or promotions.
  3. Distributed Databases and Data Stores:
    • Systems like Google Spanner, which provide globally-distributed database services, rely on linearizability to ensure that all nodes show the same data at any time. This is crucial for applications that require global reach with consistent data views.
    • NoSQL databases like Apache Cassandra can be configured for linearizable consistency (via Lightweight transactions) to ensure that certain operations, like conditional writes, are safely coordinated.
  4. Real-Time Multiplayer Gaming:
    • Ensuring that all players in a game see the same game state simultaneously. Linearizability is crucial for actions that have immediate consequences on the game state, influencing all players’ views and interactions.
  5. Online Ticketing Systems:
    • Systems used for booking seats at events or on flights must reflect changes universally the moment they occur. This prevents the system from double-booking a given seat.
  6. Distributed Systems Providing Locks and Leader Election:
    • Systems like Apache ZooKeeper use linearizability to manage distributed locks and leader election processes. These mechanisms ensure that, at any given time, all nodes in the system agree on which node holds the lock or who is the leader.

Benefits:
- Consistency and Simplicity: Linearizability simplifies the development and reasoning of distributed applications by providing a model that closely mimics a single, atomic system.
- Correctness: Guarantees the correctness of operations, especially in environments where decisions based on data state are critical.

Challenges:
- Performance and Scalability: Achieving linearizability can come at the cost of increased latency and decreased throughput, particularly in systems distributed across wide geographical areas.
- Complexity: Maintaining linearizability requires complex coordination mechanisms, often involving consensus protocols like Raft or Paxos, which can be challenging to implement and maintain.

Linearizability is crucial in environments where the accuracy and immediate visibility of operations are necessary for system integrity and user experience. While it provides strong consistency guarantees, the trade-offs in terms of performance and system complexity must be carefully considered and balanced against the specific requirements of the application.

96
Q

Implementation of linearizability

A

Implementation Techniques

1.	Consensus Algorithms:
•	Protocols like Raft or Paxos are commonly used to achieve consensus on the order of operations among distributed nodes. These protocols ensure that all nodes agree on a single, linear sequence of operations.
•	Each operation is proposed by a node and must be agreed upon by a majority of nodes before it is considered committed. This prevents split-brain scenarios and ensures a single, consistent order of operations.
2.	Replicated State Machines:
•	Systems implement linearizability by using replicated state machines, where each node in the system runs the same deterministic state machine and processes the same sequence of operations in the same order.
•	Consensus algorithms are used to keep these state machines in sync across nodes.
3.	Timestamps and Versioning:
•	Some systems use logical clocks or timestamps to order operations. Each operation is tagged with a timestamp that represents its position in the global order.
•	Systems like Google Spanner combine timestamps with consensus-based synchronization (using TrueTime, which provides globally synchronized clocks) to ensure linearizability across a distributed database.
4.	Client Interactions and System Responses:
•	To the client, operations must appear instant. This means that once a write operation is acknowledged by the system, any subsequent read must reflect that write or any later writes.
•	Optimistic concurrency control can be employed, where the system checks if the operation would violate linearizability at the point of commit and aborts if necessary.
5.	Write-Ahead Logging (WAL):
•	Systems like databases ensure that every change is recorded in a log before it is applied to the database. This log is crucial for recovering the system to a consistent state in case of failures.

Practical Challenges

•	Network Delays: In a distributed environment, network delays can significantly impact the performance of consensus algorithms and the overall latency of operations.
•	System Overhead: Implementing linearizability involves significant complexity and overhead, as each operation potentially requires coordination and agreement across multiple nodes.
•	Scalability: As the number of nodes increases, the complexity and latency of achieving consensus can also increase, potentially impacting scalability.

Example Use Case: Google Spanner

•	Google Spanner is an example of a globally-distributed database that implements linearizability. It uses a combination of synchronized clocks (TrueTime) and a Paxos-based consensus protocol to ensure that all transactions are linearizable across the globe.
•	This approach allows Spanner to provide not only strong consistency but also globally-distributed transactions, which is particularly challenging given the latency and fault tolerance issues associated with global distribution.

Conclusion

Implementing linearizability in a distributed system is a complex but critical task for systems where the strongest consistency guarantees are required. It requires careful design of data replication, state management, and client interaction protocols. While powerful, the trade-offs in terms of system complexity, performance, and operational overhead must be carefully managed to ensure that the benefits outweigh the costs.

97
Q

How does Raft ensure a node with stale data not elected as leader

A

No it does not ensure.

If the new leader is slightly left behind one follower, it will sync form the follower and replicate the updates to all followers.

98
Q

How does Raft backfill stale nodes

A

When a leader is elected in Raft, it immediately takes responsibility for managing the log replication process across all nodes (followers). The leader maintains a nextIndex for each follower, which tracks the index of the next log entry the leader will send to that follower.

  • When the leader attempts to replicate new log entries, it sends AppendEntries RPCs that include log entries starting from nextIndex for each follower.
  • Each AppendEntries RPC includes the term and the index of the entry immediately preceding the new entries being sent. This helps the follower to check whether its log is consistent with the leader’s.
  • If a follower’s log does not contain the entry at the index and term specified by the leader (indicating a mismatch), the follower will reject the AppendEntries RPC.
  • Upon receiving a rejection, the leader decrements the nextIndex for that follower and retries sending the AppendEntries RPC, this time including earlier entries.
  • This process continues, with the leader decrementing the nextIndex and resending AppendEntries RPCs, until the follower’s log entries match the leader’s log up to the point of divergence.
  • Once the logs are consistent (i.e., the follower has all the entries up to the current entry that the leader attempted to replicate), the leader then sends all subsequent log entries to the follower.
  • The follower processes and appends each new entry to its log as it receives successful AppendEntries RPCs from the leader.
  • This method of decrementing the nextIndex and retrying ensures that even significantly stale nodes can be brought fully up-to-date by progressively backfilling all missing entries.
  • It’s crucial that during this catch-up process, log entries are not only replicated but also committed. An entry is considered committed when it is safely replicated on a majority of nodes.
  • The leader keeps track of which entries have been successfully replicated on each node, and updates the commitIndex accordingly. The commitIndex is the highest log index known to be committed.
  • To optimize the process of catching up stale nodes, Raft implementations can include optimizations such as sending snapshots of the entire state up to a certain point if the node is extremely far behind. This is particularly useful when the missing entries are so numerous that sending them individually would not be sufficient
99
Q

How does raft ensures linearizability

A

Raft actually provides strong consistency for single leader replication, but it’s important to clarify what is meant by “strong consistency” in this context and how Raft achieves it.

In the context of distributed systems, strong consistency typically means that any read operation will return the most recent write operation completed by the system to any given client. Raft achieves strong consistency through the following mechanisms:

  1. Leader Exclusivity: Raft enforces that all changes to the replicated state machine are driven through the leader. This leader is elected by a majority of the cluster and handles all client requests for modifications (writes).
  2. Log Matching Property: Raft ensures that the leader’s log is replicated exactly on a majority of the servers before any log entry is committed. This means that once an entry is marked as committed, it has been replicated on the majority of the nodes.
  3. Commitment Rules: An entry from the leader’s current term that is stored on a majority of servers is considered committed. This is critical because it means any committed entry will be present in any future leaders’ logs, thereby providing continuity and consistency through leader changes.
  4. Reads Are Handled by the Leader: By default, all reads are handled by the leader. This ensures that any read request returns the latest committed state. Since the leader is the only node that drives the appending of new entries and commits entries to the state machine, it always has the most up-to-date and consistent view of the state.

While Raft ensures that the leader commits entries only after these entries are replicated on a majority of nodes, it doesn’t require all nodes to be up-to-date for the following reasons:

  • Fault Tolerance: Raft is designed to handle failures. Requiring all nodes to acknowledge before committing would diminish the system’s availability and fault tolerance. If even one node were to fail or become unreachable, the system would not be able to proceed with any updates.
  • Efficiency: Communicating with a majority rather than all nodes reduces the coordination overhead and allows the system to make progress even in the presence of network delays or node failures.

Nodes that do not form part of the majority during a particular commit phase may lag temporarily but are eventually brought up to date through Raft’s log replication process. The leader continuously attempts to replicate all log entries to all followers. If a follower is behind, the leader will send it the missing entries to catch up. This ensures that even if a node missed earlier commit rounds, it will eventually reflect the same state as the leader.

Raft does provide strong consistency by ensuring that once an operation is committed, all future operations will see the results of that operation. The distinction in Raft’s approach is its reliance on a majority for commitment rather than unanimity, balancing consistency with availability and fault tolerance in accordance with the CAP theorem. This design makes Raft suitable for systems where strong consistency, availability, and partition tolerance are required.

100
Q

What is coordination service

A

In the context of distributed transactions, a coordination service in distributed systems is a specialized service designed to manage and facilitate the coordination of processes, tasks, and data across multiple nodes within a distributed environment.

  1. Locking and Synchronization:
    • Distributed Locks: Provides mechanisms to synchronize access to shared resources across the network, ensuring that only one process can execute a critical section at a time.
    • Barriers: Allow multiple processes to wait at a certain point until all have reached this barrier, facilitating synchronized start or completion of operations.
  2. Leadership Election:
    • Many distributed applications require a leader that coordinates actions among participants. Coordination services can manage the election process, ensuring that there is always a designated leader, and handling automatic failover if the leader fails.
  3. Metadata and Configuration Management:
    • Coordination services often manage configuration information that must be distributed and synchronized across various components in the system, such as service discovery data, configuration settings, and cluster membership information.
  4. State Coordination:
    • Maintain and coordinate state information that needs to be consistent across the system, such as user sessions, application state, or distributed counters.
  1. Apache ZooKeeper:
    • A high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface so you don’t have to write them from scratch. ZooKeeper can maintain configuration information, provide distributed synchronization, and group services. All these kinds of services are used in some form or another by distributed applications.
  2. etcd:
    • A distributed reliable key-value store that is often used for shared configuration and service discovery. etcd is built on the Raft consensus algorithm, which helps nodes in the etcd cluster to agree on updates to the store, ensuring strong consistency across all nodes.
  3. Consul:
    • Consul offers capabilities for service discovery, health checking, and key/value storage. It provides a full-featured control plane with service discovery, configuration, and segmentation functionality.
  • Service Discovery: These services can manage a registry of services, helping applications to discover each other over the network dynamically.
  • Configuration Management: Dynamic management of configuration settings across many servers, allowing changes to be propagated quickly and consistently.
  • Distributed Semaphores/Counters: Managing counters and semaphores that must operate across multiple nodes, ensuring they remain consistent despite failures or concurrency issues.
  • Coordination of Clustered Jobs and Tasks: Managing and coordinating batch processing or real-time computation jobs across a cluster of machines.

Coordination services play a pivotal role in building reliable, scalable, and maintainable distributed systems by providing essential mechanisms for managing the complexities of distributed computing. They help in reducing the development complexity of distributed applications by offering out-of-the-box solutions to common problems such as synchronization, state management, and configuration distribution, thus enabling developers to focus more on application-specific logic rather than the intricacies of the underlying system coordination.

101
Q

What are stored in coordination service?

A
  1. Configuration Store:
    • Purpose: This includes settings for different nodes, services, and possibly client-specific configurations.
    • Functionality: Allows for dynamic changes to configuration without needing to restart services or nodes, enabling on-the-fly reconfiguration of the distributed system.
  2. State Store:
    • Purpose: This could include session states, application states, or states of various distributed algorithms (e.g., leader election, consensus states).
    • Functionality: Provides mechanisms to retrieve and update state consistently across nodes, often utilizing consensus algorithms to ensure that all changes are agreed upon across the cluster.
  3. Metadata Store:
    • Purpose: Keeps metadata about the system itself, such as information about the nodes in the cluster, network topology, service locations, and health status.
    • Functionality: Essential for service discovery mechanisms, load balancing, and managing network topology changes dynamically.
  4. Lock Store:
    • Purpose: Manages locks and other synchronization primitives that are used by distributed applications to coordinate access to shared resources.
    • Functionality: Implements distributed locking protocols that ensure that only one client or node can access a critical section at a time, preventing race conditions and data corruption.
  • Apache ZooKeeper: Uses a znode tree (hierarchical namespace) to store data. Each znode can store data up to a certain size (default is 1MB) and can also have children znodes, making it suitable for configuration, metadata, and state storage. ZooKeeper ensures that this data is replicated across all nodes in the ensemble, providing high availability and consistency.
  • etcd: Provides a reliable key-value store that is used primarily for shared configuration and service discovery. It uses the Raft consensus algorithm to keep data consistent across a cluster of nodes. The data stored in etcd can be anything from URLs of services, configuration parameters, to scheduled jobs and leader election tokens.
  • Consul: Offers a key-value store as part of its service discovery system, along with health checking. The store is used for configuration data and can be accessed via HTTP API, providing a flexible and easy mechanism for storing and retrieving configuration across services.
102
Q

Give me some examples of linearizable databases

A

Linearizability in databases refers to a consistency model where once a write operation completes, any subsequent read operation will reflect that write or others that followed. This strong form of consistency ensures that the database behaves as if all operations are occurring instantaneously at some point between their start and finish times. Here are some examples of databases that support or can be configured to support linearizable consistency:

  1. Google Spanner:
    • Overview: Google Spanner combines the benefits of traditional relational databases with the scalability of NoSQL databases. It supports globally-distributed databases and applications, providing a unique feature of external consistency across globally distributed data.
    • Linearizability: Spanner uses TrueTime API which gives it an advantage in implementing external consistency (a stricter form of linearizability). TrueTime relies on GPS and atomic clocks to provide precise time stamps, which Spanner uses to ensure that transactions are executed in a strictly serial order.
  2. Apache Cassandra (with Lightweight Transactions):
    • Overview: Apache Cassandra is a highly scalable NoSQL database known for its high write availability and partition tolerance. It traditionally offers eventual consistency but can be configured for stronger consistency levels.
    • Linearizability: By using Lightweight Transactions which are based on the Paxos consensus protocol, Cassandra can achieve linearizable consistency for writes. These transactions ensure that, for any given piece of data, only one write succeeds if multiple clients are writing simultaneously.
  3. etcd:
    • Overview: etcd is a distributed key-value store that provides a reliable way to store data across a cluster of machines. It’s particularly well-suited for managing configuration and state in clustered systems.
    • Linearizability: etcd guarantees that its reads and writes are linearizable. It uses the Raft consensus algorithm to ensure that all operations are agreed upon by a majority of nodes before they are committed, providing strong consistency.
  4. RethinkDB:
    • Overview: RethinkDB is an open-source, scalable database that stores JSON documents and provides real-time push capabilities.
    • Linearizability: It offers linearizable consistency for immediate consistency needs on single documents via its “majority” write acknowledgments and read settings.
  5. FoundationDB:
    • Overview: FoundationDB is a distributed database designed to handle large volumes of structured data across clusters of commodity servers. It integrates the NoSQL capabilities with the transactional integrity of SQL.
    • Linearizability: FoundationDB supports fully ACID transactions and can be configured to provide linearizable consistency, ensuring that all clients see a consistent and up-to-date view of the database.
  6. CockroachDB:
    • Overview: CockroachDB is a SQL database that automates scaling and recovery, supports ACID transactions, and provides a strong consistency model.
    • Linearizability: It offers serializable transactions which are the strongest level of isolation and by default ensures linearizable consistency on all transactions.

These databases each provide mechanisms to achieve linearizability, either as a core feature of their design or as an optional configuration, making them suitable for applications where strong consistency is necessary to ensure data integrity and correct application behavior.

103
Q

How to read from followers in raft cluster?

A

Reading from followers in a Raft cluster can help distribute the read load and improve the scalability of the system, especially useful when the leader node might become a bottleneck due to heavy read requests. However, allowing reads from followers introduces challenges related to ensuring that the data read is consistent and up-to-date. To achieve this, several strategies can be implemented, each with its own trade-offs in terms of consistency, latency, and complexity.

  1. Read-Index Protocol:
    • Concept: The Read-Index protocol is a way to allow linearizable reads on followers without reading from the leader directly. It ensures that the follower’s log is up-to-date by making use of the leader’s knowledge about the committed log index.
    • Configuration:
      • The follower node that receives a read request sends a ReadIndex request to the leader.
      • The leader returns the current commit index if it still believes itself to be the leader.
      • The follower waits until its log is at least as up-to-date as the commit index it received. Once it is, the follower knows that it reflects all entries committed in the cluster and can safely serve the read request.
  2. Lease-Based Read:
    • Concept: In this approach, the leader grants leases to the followers, allowing them to serve reads for a specified duration during which the leader promises not to revoke its leadership (unless it loses contact with a majority of the cluster).
    • Configuration:
      • The leader periodically sends heartbeat messages to maintain its authority and renew the lease terms.
      • Followers are configured to consider the lease valid only if they keep receiving these heartbeats within the expected intervals.
      • During the validity of the lease, followers can serve read requests knowing that they are seeing up-to-date data.
  3. Stale Reads:
    • Concept: If the application can tolerate stale or eventually consistent reads, followers can serve reads directly from their current state without extra coordination with the leader.
    • Configuration:
      • No additional specific configuration on the Raft protocol is required, but the application logic needs to be aware that the data may be out of date.
  • Ensuring Linearizability: The Read-Index and Lease-Based Read strategies provide mechanisms to ensure linearizable reads from followers. This is crucial for applications where reading stale data could lead to incorrect operations or inconsistencies.
  • Handling Network Partitions: In cases of network partitions, the lease mechanism helps prevent split-brain scenarios where more than one node might believe it is the leader. This is crucial for maintaining the integrity of the data.
  • Performance and Latency: While reading from followers can offload the leader, the mechanisms to ensure consistency (like Read-Index or leases) might introduce additional latency due to the round-trip requests to the leader or waiting for the log to catch up.

In practice, deploying these configurations in a Raft-based system requires careful planning and testing, especially concerning the consistency requirements of your application and the expected read/write load. Monitoring and tuning may be necessary to balance between consistency, availability, and performance. Tools and systems built on Raft, such as etcd or Consul, often provide built-in support for configuring and managing these read strategies.

104
Q

How does raft achieve distributed consensus on writes

A

Raft replicates changes to followers through a structured log replication process, which is a core component of ensuring consistency and durability in a distributed system. The replication mechanism in Raft ensures that all committed entries are consistently stored across a majority of nodes, maintaining the integrity and availability of data even in the face of failures. Here’s a detailed breakdown of how Raft replicates changes to followers:

  • Initial Step: Raft begins by electing a leader. This happens whenever a cluster starts up or if the existing leader fails or becomes unreachable. The leader election process uses timed elections to prevent split votes and ensure that one node is designated as the leader.
  • Client Requests: When clients make requests that involve changes (e.g., writing data), these requests are sent directly to the leader. The leader then appends the new entries representing these changes to its local log.
  • AppendEntries RPC: The leader begins the replication process by sending AppendEntries RPCs to all follower nodes. This RPC contains:
    • Log Entries: The entries that need to be replicated, which have not yet been stored on the followers.
    • PrevLogIndex and PrevLogTerm: Information about the log entry immediately preceding the new entries, which is used by followers to ensure log consistency.
    • Leader’s Commit Index: The index of the highest log entry known to be committed. This helps followers know up to which entry they can safely apply log entries to their state machines.
  • Follower Response: Each follower receives the AppendEntries RPC and first checks if its log contains an entry at PrevLogIndex whose term matches PrevLogTerm.
    • If it matches, the follower appends any new entries sent by the leader to its log.
    • If it does not match, the follower rejects the AppendEntries RPC and informs the leader of the discrepancy.
  • Handling Inconsistencies: If a follower’s log is inconsistent with the leader’s log, the leader adjusts the PrevLogIndex it sends in subsequent AppendEntries RPCs to find the point of agreement. This may involve the leader decrementing the index until the logs match.
  • Majority Acknowledgement: An entry is considered committed when the leader has replicated it on a majority of nodes (including itself). The leader keeps track of the highest index that has been replicated on each follower to determine when it has safely achieved a majority.
  • Applying to State Machine: Once an entry is committed, the leader broadcasts a new AppendEntries RPC with an updated commit index. Followers apply entries up to the commit index to their state machines.
  • Confirmation of Operations: After entries are committed and applied to the state machine, the leader sends responses back to the clients confirming the execution of their requests.
  • Robust Against Failures: Raft’s replication process is designed to handle node failures. If a follower fails during replication, the leader continually retries sending AppendEntries RPCs until the follower is restored and catches up.
  • Leader Changes: If a leader fails, the election process creates a new leader, which then resumes the task of log replication, ensuring that the system continues to operate smoothly.

This log replication mechanism forms the backbone of Raft’s ability to provide a robust, consistent, and fault-tolerant distributed consensus. By ensuring that all committed changes are replicated across a majority of nodes, Raft guarantees that the system can recover from failures without losing data.

105
Q

Relational database versus non-relational database

A

Many to many;
Joins; cross nodes/partitions read and write
Data locality; data denormalization.

Traditionally most no sql databases are non-relational however that’s not always the case.

106
Q

Challenges with cross-partition/nodes join with relational database

A

Join operations across different partitions or nodes in a relational database can indeed be slower compared to joins that occur within a single node or partition. This decrease in performance is due to several factors associated with the distribution of data across a networked environment. Here’s a detailed exploration of why these joins are typically slower and the challenges involved:

  1. Data Distribution:
    • Location Transparency: In distributed databases, data is spread across different nodes or partitions, often based on sharding or partitioning criteria that optimize for data locality and access patterns. When a join operation involves tables located on different nodes, the database system must locate and access this data across the network.
    • Non-Colocated Joins: If the joining keys or the related data are not colocated on the same node, the database must perform additional steps to bring the data together, which typically involves significant network I/O.
  2. Network Overhead:
    • Data Transfer Costs: Moving large datasets across network links for the purpose of joins is costly in terms of bandwidth and time. This data movement can introduce significant latency.
    • Network Latency and Reliability: Network speed and reliability also play a crucial role. Inconsistent network behavior can exacerbate the performance issues, making the joins not only slower but also less predictable in performance.
  3. Serialization and Deserialization:
    • Data Processing: Data must be serialized (converted into a format suitable for transfer) before being sent over the network and then deserialized on the receiving end. These operations add computational overhead and delay.

Despite the inherent challenges, several strategies can help mitigate the impact of these performance issues in distributed joins:

  1. Query Optimization:
    • Join Algorithms: Modern distributed databases employ advanced join algorithms like hash joins, merge joins, or nested loop joins that are optimized for distributed environments.
    • Query Planning: SQL query planners in distributed databases can optimize how and where joins are executed, sometimes rearranging joins or pushing down join operations to where the data resides.
  2. Data Locality Improvements:
    • Colocation of Related Data: Whenever possible, data that is frequently joined together can be colocated on the same node. This practice reduces the need to transfer data between nodes during join operations.
    • Partitioning Strategy: Effective partitioning strategies that group related data together based on common join keys or query patterns can significantly reduce cross-node joins.
  3. Caching Mechanisms:
    • Local Caching of Frequently Accessed Data: Caching mechanisms can store recently or frequently accessed data locally. This can reduce the need for repeated data fetches across the network for common join operations.
  4. Network Enhancements:
    • High-Speed Networks: Deploying high-speed networking infrastructure like InfiniBand or upgrading to faster Ethernet standards can reduce data transfer times significantly.
    • Dedicated Network Resources: Allocating dedicated network resources for database communication can help avoid network contention and variability.

While joins across different partitions or nodes in a relational database are inherently slower due to the complexities of distributed data management, careful design choices regarding data placement, network architecture, and query optimization can significantly improve performance. Understanding and addressing these factors are crucial for maintaining efficient operations in distributed database environments.

107
Q

Diff btw non-repeatable read and phantom read

A
  • Definition: A non-repeatable read occurs when a transaction reads the same row twice and finds different data each time. This happens because another transaction modifies the data after the first read but before the second read, and commits the change.
  • Example: Consider a transaction ( T1 ) that reads a row from a table. If another transaction ( T2 ) modifies or updates that row and commits the change after ( T1 )’s first read but before ( T1 ) reads the row again, ( T1 ) will see different data in its second read. This inconsistency in ( T1 )’s view of the data during the same transaction leads to a non-repeatable read.
  • Cause: This type of read anomaly is typical in isolation levels that allow concurrent writes, such as the Read Committed isolation level in SQL databases.
  • Definition: A phantom read occurs when a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently committed transaction. This anomaly is a result of new rows being added or existing rows being deleted by another transaction after the initial read.
  • Example: Suppose a transaction ( T1 ) performs a query to select rows where the value of a column is less than 10. ( T1 ) then repeats the query later in the transaction. If another transaction ( T2 ) adds a new row or deletes an existing row that would appear in ( T1 )’s query results in between ( T1 )’s queries and commits, then the subsequent execution of the same query by ( T1 ) will include or exclude different rows, thus leading to a phantom read.
  • Cause: Phantom reads are generally observed under isolation levels such as Read Committed and Repeatable Read, but can be prevented by using Serializable isolation, which effectively locks the relevant range of the table.
  • Read Committed: Prevents dirty reads but allows non-repeatable reads and phantom reads. Only committed data can be read, and each read operation retrieves the most recently committed version.
  • Repeatable Read: Ensures that data read once in a transaction can be re-read repeatedly and will remain unchanged for the duration of the transaction, thereby preventing non-repeatable reads. However, it does not prevent phantom reads because it typically only locks individual rows.
  • Serializable: This is the highest level of isolation. It prevents dirty reads, non-repeatable reads, and phantom reads by performing range locking where necessary, thus serializing the transactions.
108
Q

A summary of dirty reads, read skew and phantom reads and how to prevent them

A

Definition:
- Dirty reads occur when a transaction reads data that has been modified by another transaction but not yet committed. If the other transaction is rolled back, the read data becomes invalid.

Prevention Mechanism:
- Read Committed Isolation Level: This isolation level ensures that transactions can only read data that has been committed by other transactions, preventing dirty reads.

Implementation:
- Implemented by locking the data until the transaction modifying it is committed. This prevents other transactions from reading uncommitted changes.

Definition:
- Read skew occurs when a transaction reads the same row twice and gets different values because another transaction has modified and committed the row in the meantime.

Prevention Mechanism:
- Repeatable Read Isolation Level: This isolation level ensures that if a transaction reads a row, it will always see the same data for that row if it reads it again within the same transaction.

Implementation:
- Implemented by locking the data when it is read, preventing other transactions from modifying it until the current transaction is complete.

Definition:
- Phantom reads occur when a transaction re-executes a query returning a set of rows that satisfies a condition and finds that the set of rows has changed due to another recently committed transaction.

Prevention Mechanism:
- Serializable Isolation Level: This isolation level ensures complete isolation from other transactions, preventing any other transactions from inserting, updating, or deleting rows that would match the query’s condition until the current transaction is complete.

Implementation:
- Implemented by locking the range of rows being queried, ensuring no other transactions can alter the dataset that satisfies the condition of the initial query.

  1. Dirty Reads Prevention:
    • Transactions read only committed data.
    • When a transaction modifies data, it locks the data until the transaction is committed.
    • Other transactions attempting to read this data must wait until the lock is released, ensuring they only read committed data.
  2. Read Skew (Non-Repeatable Reads) Prevention:
    • When a transaction reads data, it places a shared lock on the data.
    • This lock ensures that other transactions can read the data but cannot modify it until the initial transaction completes.
    • This ensures that if the same transaction reads the data again, it will see the same values, preventing non-repeatable reads.
  3. Phantom Reads Prevention:
    • Transactions operate in a completely isolated manner.
    • When a transaction queries a set of rows, it locks the range of data that satisfies the query conditions.
    • This prevents other transactions from inserting, updating, or deleting rows that would alter the result set of the query until the initial transaction is completed, preventing phantom reads.

By using these isolation levels, databases ensure data consistency and integrity by preventing various types of read anomalies.

109
Q

A summary of dirty writes, writes skew and phantom writes

A

Definition:
- Dirty writes occur when a transaction overwrites data that has been modified by another transaction but not yet committed. This can lead to inconsistent data if the first transaction is rolled back.

Prevention Mechanism:
- Write Locking: Ensure that once a transaction modifies data, other transactions cannot write to the same data until the first transaction is committed.

Implementation:
- Implemented by using exclusive locks on the data when a transaction writes to it. Other transactions attempting to write to the same data must wait until the lock is released.

Definition:
- Write skew occurs in a multi-transaction system where two transactions read overlapping data and then proceed to write based on their read, potentially leading to inconsistent states.

Prevention Mechanism:
- Serializable Isolation Level: This isolation level ensures transactions are executed in a way that the outcome is equivalent to executing them sequentially.

Implementation:
- Implemented by checking the read and write sets of transactions and ensuring that any potential conflicts are detected and managed, usually by aborting one of the conflicting transactions and retrying it.

Definition:
- Phantom writes occur when a transaction inserts, updates, or deletes rows that would alter the result set of a previously executed query by another transaction.

Prevention Mechanism:
- Serializable Isolation Level: This isolation level ensures complete isolation from other transactions, preventing any operations that would affect the dataset of a query until the transaction is completed.

Implementation:
- Implemented by locking the range of rows affected by the query condition, ensuring no other transactions can insert, update, or delete rows that would change the result set until the transaction completes.

  1. Dirty Writes Prevention:
    • When a transaction writes to data, it places an exclusive lock on the data.
    • This exclusive lock ensures that no other transactions can write to the same data until the initial transaction is committed or rolled back.
    • By preventing concurrent writes, it ensures data consistency and avoids dirty writes.
  2. Write Skew Prevention:
    • When a transaction reads data, it notes the data’s state and places appropriate locks.
    • Before committing, the system checks if any other transaction has written to the same data.
    • If a conflict is detected (i.e., another transaction has modified the data), one of the transactions is aborted and must retry.
    • This ensures the final state is consistent and equivalent to some sequential execution of the transactions.
  3. Phantom Writes Prevention:
    • When a transaction queries a set of rows, it locks the range of data that satisfies the query conditions.
    • These locks prevent other transactions from inserting, updating, or deleting rows that would alter the result set of the initial query.
    • By locking the relevant data range, it ensures that the result set remains consistent and prevents phantom writes.

Using these mechanisms and isolation levels, databases can ensure data consistency and integrity by preventing various types of write anomalies.

110
Q

MySQL versus PostgresSQL on isolation

A

The major differences between PostgreSQL and MySQL in terms of isolation levels primarily revolve around their implementation of isolation mechanisms, default behaviors, and supported features. Here’s a comparison:

  1. Default Isolation Level:
    • PostgreSQL uses Read Committed as its default isolation level.
  2. Supported Isolation Levels:
    • Read Uncommitted: PostgreSQL does not directly support this isolation level.
    • Read Committed: Ensures that any data read during a transaction is committed at the moment it is read. This is the default.
    • Repeatable Read: Ensures that if a transaction reads a row, it will see the same data if it reads that row again within the same transaction, preventing non-repeatable reads.
    • Serializable: Provides full serializability using a technique called Serializable Snapshot Isolation (SSI). This level prevents all read and write anomalies, ensuring transactions are serializable without using traditional two-phase locking.
  3. Concurrency Control:
    • PostgreSQL primarily uses Multiversion Concurrency Control (MVCC), which allows readers to access a snapshot of the database at a point in time without being blocked by writers. Each transaction sees a consistent snapshot of the database, providing high concurrency and performance.
  4. Serializable Snapshot Isolation (SSI):
    • An advanced form of serializable isolation that detects conflicts and ensures serializable transactions without requiring extensive locking, reducing the likelihood of deadlocks and enhancing performance.
  1. Default Isolation Level:
    • MySQL uses Repeatable Read as its default isolation level.
  2. Supported Isolation Levels:
    • Read Uncommitted: Allows transactions to read uncommitted changes made by other transactions, which can lead to dirty reads.
    • Read Committed: Ensures that any data read during a transaction is committed at the moment it is read, preventing dirty reads.
    • Repeatable Read: Ensures that if a transaction reads a row, it will see the same data if it reads that row again within the same transaction, preventing non-repeatable reads. This is the default.
    • Serializable: The highest isolation level, ensuring full serializability by locking the entire range of data being accessed, thus preventing all read and write anomalies. This can lead to higher contention and reduced concurrency.
  3. Concurrency Control:
    • MySQL also uses MVCC, particularly in the InnoDB storage engine, to provide non-blocking reads and support high concurrency.
  4. Differences in Implementation:
    • MySQL’s implementation of Repeatable Read differs from PostgreSQL’s as it maintains consistent snapshots of the database for the duration of a transaction, ensuring repeatable reads without phantom reads.
    • MySQL’s Serializable isolation level is more traditional, using extensive locking to ensure serializable transactions, which can lead to higher contention and more frequent deadlocks compared to PostgreSQL’s SSI.
  1. Default Isolation Level:
    • PostgreSQL: Read Committed.
    • MySQL: Repeatable Read.
  2. Serializable Isolation:
    • PostgreSQL uses Serializable Snapshot Isolation (SSI) to provide serializable transactions without heavy locking.
    • MySQL uses traditional locking mechanisms, which can result in higher contention and deadlocks.
  3. Support for Read Uncommitted:
    • PostgreSQL does not support Read Uncommitted directly.
    • MySQL supports Read Uncommitted, which allows for higher concurrency at the cost of potential dirty reads.
  4. MVCC Implementation:
    • Both databases use MVCC for high concurrency, but PostgreSQL’s MVCC implementation is known for providing better performance and consistency in complex transactional workloads.

By understanding these differences, you can make an informed decision on which database system better suits your specific application requirements and concurrency needs.

111
Q

New types of Relational DB - Volt DB

A

Initially released in 2010.

  1. In memory hash index by default - not durable; WAL is optional; tree index is optional. Less data per node means more partitions and more cross partition reads and writes (distributed consensus for reads and writes - 2 phase commit).
  2. Actual serial execution with only a single thread. Bottle neck is the CPU.
  3. Minimize the network latency with stored procedures.

——————

In-memory relational database designed for real-time analytics and high-velocity transactional workloads.

It combines in-memory storage with ACID compliance, providing fast data processing and strong consistency.

VoltDB supports distributed architecture, enabling horizontal scaling and fault tolerance. It is particularly suited for applications requiring low-latency operations, such as financial services, telecommunications, and IoT.

112
Q

Google spanner

A

Optimize read-only traffic by not using two phase commit but still satisfy linearizability.

Linerizability relies on real time not logical order! Therefore we cannot use lamport clocks!

Uncertainty period. Ensure no overlap.

GPS clock.

113
Q

Diff btw making a change and commit a change in database

A
  • Definition: At the logical level, making a change refers to executing a command that modifies the data in the database. This could be an insert, update, or delete operation.
  • Temporary State: The change is made in the context of a transaction and is only visible to the transaction that made it. Other transactions cannot see this change until it is committed.
  • Isolation: The change is isolated from other transactions. If the transaction making the change is rolled back, the change will be discarded, and the database will remain unchanged.

Example:
- BEGIN TRANSACTION;
- UPDATE Accounts SET balance = balance + 100 WHERE account_id = 1;
- The balance is increased by 100 for account_id 1, but this change is not yet visible to other transactions.

  • Definition: Committing a change means making all the modifications made in the current transaction permanent and visible to all other transactions. This finalizes the transaction.
  • Permanent State: Once a transaction is committed, all the changes are permanently applied to the database.
  • Visibility: The changes become visible to all other transactions, ensuring consistency and durability.

Example:
- COMMIT;
- All changes made since the beginning of the transaction are now permanent and visible to other transactions.

  • In-Memory Modifications: Changes are initially made in the database’s in-memory structures. These are often buffers or caches.
  • Write-Ahead Logging (WAL): Before making changes to the actual data files, databases typically write the change to a log file to ensure that changes can be recovered in case of a crash. This log entry is not yet considered committed.
  • Buffer Cache: The changes are kept in the buffer cache (RAM) and not yet written to the disk. This allows for quick access and manipulation.

Example:
- The updated balance is recorded in memory and a log entry is created but not yet marked as committed.

  • Log Flush: The database flushes the log entries to disk to ensure that all changes are durable. This step is crucial for crash recovery.
  • Data Page Flush: The actual data pages modified in memory are written to disk. This ensures that the data files on disk reflect the changes made by the transaction.
  • Transaction Marking: The transaction is marked as committed in the transaction log. This marking ensures that during recovery, the system can differentiate between committed and uncommitted transactions.

Example:
- The WAL is flushed to disk, ensuring durability.
- The modified data pages are written to disk.
- The transaction is marked as committed in the log.

  • Logical Level:
    • Making a Change: Temporary and isolated from other transactions. Visible only within the transaction.
    • Committing the Change: Makes changes permanent and visible to all transactions. Finalizes the transaction.
  • Physical Level:
    • Making a Change: Changes are made in memory and logged in the WAL but not yet marked as committed.
    • Committing the Change: Ensures all changes are written to disk, flushes the WAL, and marks the transaction as committed in the log.

Understanding these distinctions is crucial for appreciating how databases maintain consistency, durability, and isolation in the face of concurrent transactions and potential system failures.

114
Q

Difference Between Phantom Reads and Non-Repeatable Reads

A

Difference Between Phantom Reads and Non-Repeatable Reads

Non-Repeatable Reads

Definition:
A non-repeatable read occurs when a transaction reads a row, then another transaction modifies or deletes that row, and the first transaction reads the same row again, getting a different result.

Characteristics:

•	Row-Level Change: The anomaly involves changes to existing rows (updates or deletes).
•	Inconsistent Reads: A transaction reads a row, another transaction modifies or deletes the row, and the first transaction reads the row again, seeing different data or no data.

Example:

•	Transaction T1:
1.	Reads row R1 (value = 100).
•	Transaction T2:
1.	Updates row R1 (value = 200).
2.	Commits the change.
•	Transaction T1: 2. Reads row R1 again (value = 200).

Impact:

•	T1 sees different values for the same row within a single transaction, leading to inconsistency.

Phantom Reads

Definition:
A phantom read occurs when a transaction executes a query that retrieves a set of rows that satisfy a condition, then another transaction inserts or deletes rows that satisfy the condition, and the first transaction re-executes the query, getting a different set of rows.

Characteristics:

•	Set-Level Change: The anomaly involves changes to the set of rows returned by a query (inserts or deletes).
•	Inconsistent Result Sets: A transaction executes a query that returns a set of rows, another transaction inserts or deletes rows that meet the query’s criteria, and the first transaction re-executes the query, seeing a different set of rows.

Example:

•	Transaction T1:
1.	Executes SELECT * FROM employees WHERE salary > 50000; (returns 5 rows).
•	Transaction T2:
1.	Inserts a new row into the employees table with salary = 60000.
2.	Commits the change.
•	Transaction T1: 2. Executes SELECT * FROM employees WHERE salary > 50000; again (returns 6 rows).

Impact:

•	T1 sees a different set of rows for the same query within a single transaction, leading to inconsistency.

Summary of Differences

1.	Scope:
•	Non-Repeatable Reads: Involves re-reading the same row and getting different results due to updates or deletes by other transactions.
•	Phantom Reads: Involves re-executing a query and getting a different set of rows due to inserts or deletes by other transactions.
2.	Type of Change:
•	Non-Repeatable Reads: Affects the data within the rows already read (updates or deletes).
•	Phantom Reads: Affects the set of rows returned by a query (inserts or deletes).
3.	Example Scenarios:
•	Non-Repeatable Reads: Reading the same row twice and seeing different values because of an update or delete.
•	Phantom Reads: Running a query twice and seeing a different number of rows because of new inserts or deletes that meet the query criteria.

Preventing These Anomalies

To prevent these anomalies, databases use different isolation levels:

Non-Repeatable Reads

•	Repeatable Read Isolation Level:
•	Ensures that if a transaction reads a row, it will see the same data if it reads that row again within the same transaction, thus preventing non-repeatable reads.

Phantom Reads

•	Serializable Isolation Level:
•	Ensures complete isolation from other transactions, preventing all types of read anomalies, including phantom reads, by locking the range of data being queried and ensuring no other transactions can insert or delete rows that would match the query’s condition until the transaction is complete.
115
Q

Quick overview of MongoDB

A

Document oriented data model; flexibility of data model.

1. Indexing in MongoDB:

  • B-tree Indexes: MongoDB primarily uses B-tree indexes for efficient query processing. These indexes are structured to allow quick data retrieval by keeping the data sorted and balanced.
  • Types of Indexes:
    • Single Field Index: Indexes a single field in a document.
    • Compound Index: Indexes multiple fields within a document.
    • Multikey Index: Supports indexing array fields, allowing efficient querying of documents containing arrays.
    • Geospatial Index: Supports queries for geospatial data.
    • Text Index: Enables text search queries.
    • Hashed Index: Distributes data across a shard key, used in sharding.

2. Transactions in MongoDB:

  • Single Document Transactions: MongoDB has always supported atomic operations on a single document, ensuring consistency and isolation at the document level.
  • Multi-Document ACID Transactions: Starting from MongoDB 4.0, multi-document transactions are supported, allowing ACID (Atomicity, Consistency, Isolation, Durability) guarantees across multiple documents and collections within a replica set.
  • Distributed Transactions: MongoDB 4.2 introduced distributed transactions, enabling ACID transactions across multiple shards in a sharded cluster.

3. Replication Model in MongoDB:

  • Replica Sets: MongoDB uses replica sets for replication, ensuring high availability and data redundancy.
    • Primary: The main node that receives all write operations.
    • Secondary: Nodes that replicate data from the primary and can take over if the primary fails.
    • Arbiter: A lightweight node that participates in elections for primary but does not hold data.
  • Automatic Failover: If the primary node fails, an election is held among the secondary nodes to choose a new primary.
  • Read Preference: Allows applications to specify where read operations should be directed (e.g., primary, secondary, nearest).
  • Indexes: B-tree, Single Field, Compound, Multikey, Geospatial, Text, and Hashed indexes for efficient data retrieval.
  • Transactions: Single document transactions, multi-document ACID transactions within replica sets, and distributed transactions across shards.
  • Replication Model: Replica sets with primary, secondary, and arbiter nodes, supporting automatic failover and read preferences for high availability and data redundancy.

This system design approach allows MongoDB to be highly flexible, scalable, and resilient, making it suitable for a wide range of applications from simple single-server setups to complex, distributed systems.

116
Q

Cassandra

A
  1. Clustering key + sort key; everything else is optional
  2. Clustering key is used for partitioning
  3. Configuration shares via gossip
  4. Local index with a sort key
  5. All reads and writes should go to ONE partition, very little support for distributed transactions
  6. Use leaderless replication - read repair; anti-entropy
  7. Write conflicts? Last write wins —— lost writes.
  8. LSM tree + SSTable for write optimized
  9. Only row level locking; read committed isolation.
117
Q

Riak

A

Use CRDT.

118
Q

Comparison btw Apache Thrift and Apache Avro

A

Apache Thrift and Apache Avro are both tools used for data serialization and RPC (Remote Procedure Call) frameworks. They are commonly used in distributed systems for efficient data exchange and service communication. Here’s a detailed comparison of Apache Thrift and Avro:

  • Purpose: Originally developed by Facebook, Apache Thrift is designed to facilitate communication across different programming languages through efficient serialization and RPC.
  • Features: Supports multiple serialization formats, a wide range of programming languages, and includes a complete RPC framework.
  • Language Support: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml, Delphi, and more.
  • Purpose: Developed by the Apache Hadoop project, Avro is primarily designed for data serialization and is particularly well-suited for data-intensive applications and storage in Hadoop.
  • Features: Schema evolution, dynamic typing, and a compact binary format.
  • Language Support: Java, C, C++, C#, Python, Ruby, JavaScript, and more.
  • Serialization: Uses a compact binary format for efficient serialization, with optional support for JSON.
  • IDL (Interface Definition Language): Thrift defines data types and services in a Thrift IDL file, which is used to generate code for different programming languages.
  • Schema Evolution: Handles schema changes but requires careful management of IDL files to avoid breaking changes.
  • Serialization: Uses a compact binary format optimized for performance and space efficiency. Also supports JSON encoding.
  • Schema: Avro schemas are defined in JSON format. The schema is stored with the serialized data, allowing for schema evolution without the need for separate schema files.
  • Schema Evolution: Avro has strong support for schema evolution, allowing for changes to data structures without breaking compatibility.
  • RPC Support: Provides a full-featured RPC framework, including client and server libraries for various programming languages.
  • Service Definition: Services are defined in the Thrift IDL, including methods, parameters, and return types.
  • Protocol and Transport: Supports multiple protocols (e.g., binary, compact, JSON) and transports (e.g., sockets, HTTP, in-memory).
  • RPC Support: Avro includes support for RPC, but it is less mature and feature-rich compared to Thrift.
  • Service Definition: Services are defined using Avro protocols, which include request and response message formats.
  • Protocol and Transport: Supports a binary protocol over HTTP, with customizable transport mechanisms.
  • Language Flexibility: Extensive support for many programming languages, making it ideal for heterogeneous environments.
  • Integration: Well-suited for applications requiring cross-language communication and integration.
  • Language Flexibility: Good support for several programming languages, but not as extensive as Thrift.
  • Integration: Commonly used in big data ecosystems, particularly with Hadoop and Kafka, where schema evolution and storage efficiency are critical.
  • Microservices: Ideal for building microservices that need to communicate across different programming languages.
  • Cross-Language RPC: Suitable for systems requiring cross-language RPC with support for various serialization protocols and transport mechanisms.
  • Big Data: Well-suited for big data applications, particularly within the Hadoop ecosystem, where schema evolution and efficient data serialization are important.
  • Data Serialization: Excellent for applications needing efficient and compact data serialization with robust schema evolution.
  • Apache Thrift: Offers a robust RPC framework and supports a wide range of programming languages, making it ideal for heterogeneous environments requiring cross-language communication. It handles data serialization and RPC through a compact binary format and allows defining services in a Thrift IDL file.
  • Apache Avro: Focuses on data serialization with strong support for schema evolution, making it a preferred choice for big data applications, particularly in the Hadoop ecosystem. Avro uses JSON for schema definitions and stores schemas with the serialized data, facilitating easy schema changes.

Choosing between Thrift and Avro depends on the specific needs of your application:
- Use Apache Thrift if you need extensive cross-language RPC support and a robust framework for defining and managing services.
- Use Apache Avro if your primary requirement is efficient data serialization with strong schema evolution capabilities, especially in a big data context.

119
Q

What is Hadoop

A
  1. Useful for data storage - HDFS
  2. Useful for computing - MapReduce & Spark
120
Q

HDFS

A

Name node - metadata store.
* In-memory info about replicas store a certain file and files.
* WAL for recovery;
* snapshot the location/index of the WAL.

When the name node starts up, ask each data node which files it contains an and replicate file if necessary. Configurable number of replicas in each data node.

Read: ask name node for file location; name node respond the best replica - lower latency.

Write: picking the location of replicas, the name node tries to be rack aware and pick a primary node. Also name node is smart enough to do replication.

121
Q

What is HBase

A

Database built on top of Hadoop.

Allows for quick writes/key updates via LSM trees.

Allows for good batch processing abilities due to data locality via column oriented storage or range based partitioning.

Master node; Region node; Data node
Replication pipeline.

122
Q

Comparison btw Hbase and Cassandra

A

Both HBase and Cassandra are distributed NoSQL databases designed to handle large-scale data, but they have different architectures, strengths, and use cases. Here’s a detailed comparison to help determine when to use each:

HBase:
- Column-Oriented: HBase is a column-family store, where data is organized into tables, rows, and column families.
- Schema Flexibility: Allows dynamic addition of columns, but schema changes at the column family level can be more rigid.
- Data Model: Suitable for wide tables with many columns and sparse data.

Cassandra:
- Wide-Column Store: Similar to HBase, but with a more flexible schema that allows dynamic addition of columns and column families.
- Schema Flexibility: Schema changes are more flexible compared to HBase.
- Data Model: Designed for wide rows and can handle more complex data structures within rows.

HBase:
- Master-Slave Architecture: HBase has a master node (NameNode) and slave nodes (RegionServers).
- HDFS Integration: Built on top of Hadoop’s HDFS, providing fault tolerance and high throughput.
- Single Point of Failure: The HBase master can be a single point of failure, although high availability setups can mitigate this.

Cassandra:
- Peer-to-Peer Architecture: All nodes are equal, with no master node, providing high availability and no single point of failure.
- Built-in Fault Tolerance: Data is automatically replicated across multiple nodes.
- Tunable Consistency: Allows configurable consistency levels for reads and writes, offering a trade-off between consistency, availability, and partition tolerance (CAP theorem).

HBase:
- Write-Optimized: Designed for high write throughput with append-only writes to HDFS.
- Read Performance: Can be slower than Cassandra for some use cases due to its dependence on HDFS and the need for reading from multiple blocks.

Cassandra:
- Balanced Read/Write Performance: Provides high write throughput and low-latency reads.
- Read Performance: Generally faster than HBase for read-heavy workloads due to its design and in-memory caching.

HBase:
- Time-Series Data: Efficient for storing and querying time-series data due to its column-family design.
- Real-Time Analytics: Suitable for applications requiring real-time analytics on large datasets.
- Batch Processing: Integrates well with Hadoop ecosystem tools for batch processing.

Cassandra:
- Geographically Distributed Data: Ideal for applications requiring data distribution across multiple data centers.
- High Availability: Suitable for applications needing high availability and fault tolerance with tunable consistency.
- Real-Time Data: Performs well for real-time data applications such as recommendation engines, IoT data, and messaging systems.

HBase:
- Complex Setup: Requires integration with Hadoop and Zookeeper, making it more complex to set up and manage.
- Operational Overhead: Needs careful management of RegionServers, splits, and compactions.

Cassandra:
- Simpler Setup: Easier to set up with fewer dependencies and a more straightforward operational model.
- Automatic Management: Automatically handles data distribution and replication, reducing operational overhead.

HBase:
- IoT Data Ingestion and Analysis: Ingesting and analyzing data from millions of IoT devices with high write throughput and batch processing capabilities.
- Log Management: Storing and querying log data with real-time access for monitoring and troubleshooting.

Cassandra:
- Global User Base: Managing user data for a globally distributed application, ensuring high availability and low latency access.
- E-Commerce Platform: Handling product catalog and user sessions with high availability and low-latency reads and writes.

When to Use HBase:
- When you need tight integration with the Hadoop ecosystem for batch processing.
- For applications that require high write throughput and can benefit from HBase’s column-family design.
- For time-series data, real-time analytics, and applications with wide tables and sparse data.

When to Use Cassandra:
- For applications that need high availability and fault tolerance across multiple data centers.
- When you need balanced read/write performance and tunable consistency.
- For real-time data applications, geographically distributed data, and scenarios requiring easy setup and management.

By understanding the strengths and limitations of HBase and Cassandra, you can choose the right database for your specific use case, ensuring optimal performance, scalability, and reliability for your applications.

123
Q

Map reduce

A

Early 2000; built to be super resilient to failures.

DDIA 399

MapReduce is a programming model and processing technique that allows for the scalable and efficient processing of large data sets across a distributed cluster of computers. Here are the key advantages of using MapReduce:

  1. Scalability:
    • Horizontal Scaling: MapReduce can easily scale to handle petabytes of data by distributing tasks across a large number of commodity machines. As data volumes grow, additional nodes can be added to the cluster to maintain performance.
  2. Fault Tolerance:
    • Automatic Recovery: MapReduce provides robust fault tolerance. If a node fails during processing, the system automatically reassigns the failed tasks to other nodes, ensuring that the job completes successfully.
    • Data Replication: Input data is typically stored in a distributed file system like HDFS, which replicates data blocks across multiple nodes, further enhancing fault tolerance.
  3. Simplified Programming Model:
    • Abstraction: MapReduce abstracts the complexity of distributed computing, allowing developers to focus on the core business logic through the map and reduce functions without worrying about the underlying infrastructure.
    • Ease of Use: The model simplifies the implementation of parallel processing tasks, making it easier for developers to write distributed applications.
  4. Parallel Processing:
    • Concurrent Execution: MapReduce processes data in parallel by dividing it into smaller chunks and distributing these chunks across multiple nodes. This leads to significant improvements in processing speed and efficiency.
  5. Data Locality:
    • Optimized Data Access: MapReduce moves computation to the data, meaning tasks are executed on the nodes where the data resides. This reduces network I/O and improves performance by leveraging data locality.
  6. Flexibility and Versatility:
    • Broad Use Cases: MapReduce can be used for a wide variety of data processing tasks, including ETL (Extract, Transform, Load) processes, data mining, machine learning, and log analysis.
    • Language Support: Implementations of MapReduce (like Hadoop) support multiple programming languages, including Java, Python, and others, making it accessible to a wide range of developers.
  7. Load Balancing:
    • Efficient Resource Utilization: The MapReduce framework distributes tasks across the cluster in a balanced manner, ensuring optimal utilization of resources and preventing bottlenecks.
  8. Integration with Big Data Ecosystem:
    • Hadoop Ecosystem: MapReduce integrates seamlessly with other components of the Hadoop ecosystem, such as HDFS for storage, YARN for resource management, and tools like Hive and Pig for data querying and scripting.

Consider a scenario where a company needs to analyze log files generated by web servers to identify popular pages and user behavior patterns.

  1. Map Phase:
    • Each log file is split into lines.
    • The map function processes each line, extracting relevant information such as the page URL and user ID.
    • The map function emits key-value pairs, with the page URL as the key and a count of 1 as the value.
  2. Shuffle and Sort:
    • The framework automatically shuffles and sorts the key-value pairs based on the keys (page URLs).
    • This phase groups all counts for the same URL together.
  3. Reduce Phase:
    • The reduce function aggregates the counts for each URL.
    • It sums the values associated with each key, producing the total count of accesses for each page.
  4. Output:
    • The final output is a list of URLs with their corresponding access counts, stored in HDFS or another storage system for further analysis.

MapReduce offers several advantages for processing large data sets in a distributed and parallel manner:

  • Scalability: Easily handles growing data volumes by adding more nodes.
  • Fault Tolerance: Automatically recovers from node failures.
  • Simplified Programming: Abstracts the complexities of distributed computing.
  • Parallel Processing: Efficiently processes data in parallel.
  • Data Locality: Reduces network I/O by processing data where it resides.
  • Flexibility: Supports a wide range of data processing tasks.
  • Load Balancing: Ensures efficient resource utilization.
  • Integration: Works well with other tools in the big data ecosystem.

These advantages make MapReduce a powerful and versatile tool for big data processing and analytics.

124
Q

Batch job data joins

A

DDIA 404

When we talk about joins in the context of batch processing, we mean resolving all occurrences of some association within a dataset. For example, we assume that a job is processing the data for all users simultaneously, not merely looking up the data for one particular user (which would be done far more efficiently with an index).

  • Sort merge joins; can be extremely slow; have to sort all data by join key; have to send at least one whole dataset over the network.
  • Broadcast hash joins; sends entire small dataset to all partitions and does a hash join. Don’t need to sort big dataset, small dataset must fit into memory;
  • Partitioned hash joins; if both datasets are too big to fit into memory but they are partitioned the same way so that the partitions can fit into memory.
125
Q

Why map reduce sucks

A

See in the picture

126
Q

Spark

A
  1. Spark is faster than map reducer by using RDD resilient distributed dataset.
  2. Spark fault tolerance; narrow dependency; wide dependency - rely on data from other nodes to complete our computation - write to disk
127
Q

Trade offs btw TCP connections and message broker for event streaming

A

When designing a system for producer and consumer event streaming, you might choose between using direct TCP connections or a message broker. Each approach has its own trade-offs in terms of complexity, performance, scalability, reliability, and more. Here’s a detailed comparison:

Advantages:

  1. Lower Latency:
    • Direct Communication: TCP connections offer direct communication between producers and consumers, reducing the latency introduced by intermediaries.
    • Fewer Hops: Since messages don’t pass through a broker, the path is shorter, leading to potentially faster message delivery.
  2. Simplicity in Small Scale:
    • Direct Implementation: For small-scale systems, direct TCP connections can be simpler to implement and manage without the overhead of deploying and maintaining a message broker.
  3. Resource Efficiency:
    • No Intermediary: Avoids the additional resource overhead that comes with running a message broker, such as CPU, memory, and disk I/O.

Disadvantages:

  1. Scalability:
    • Connection Management: Managing a large number of TCP connections can become complex and resource-intensive. Each producer-consumer pair requires a dedicated connection.
    • Network Load: High numbers of connections can put significant load on the network infrastructure.
  2. Reliability and Resilience:
    • Failure Handling: Handling producer or consumer failures requires additional mechanisms for retrying and ensuring message delivery.
    • Data Loss: Without a broker, ensuring that messages are not lost due to consumer downtime or network issues requires more complex handling.
  3. Load Balancing:
    • Manual Balancing: Distributing load among consumers must be managed manually or through custom logic, which can be error-prone and inefficient.

Advantages:

  1. Scalability:
    • Connection Pooling: A message broker can handle many producers and consumers efficiently, managing connections and distributing messages without a direct link between every producer-consumer pair.
    • Horizontal Scaling: Brokers like Kafka can scale horizontally to handle increased load by adding more broker nodes.
  2. Reliability and Fault Tolerance:
    • Persistent Storage: Brokers can persist messages to disk, ensuring no data loss even if consumers are temporarily unavailable.
    • Acknowledgements and Retries: Brokers typically provide mechanisms for message acknowledgements and retries, ensuring reliable delivery.
  3. Load Balancing and Distribution:
    • Automatic Distribution: Brokers can distribute messages to consumers based on configurable policies, ensuring balanced load and efficient processing.
    • Consumer Groups: In systems like Kafka, consumer groups allow multiple consumers to share the load of processing messages from a topic.
  4. Decoupling:
    • Loose Coupling: Producers and consumers are decoupled, allowing each to evolve independently. Producers need not know about the consumers, and vice versa.

Disadvantages:

  1. Increased Latency:
    • Intermediate Hop: Introducing a broker adds an extra hop for messages, which can increase latency compared to direct TCP connections.
  2. Complexity and Overhead:
    • Setup and Maintenance: Deploying, configuring, and maintaining a message broker adds complexity. It requires additional infrastructure and operational overhead.
    • Resource Usage: Brokers consume CPU, memory, and disk resources, which can be significant, especially under high load.
  3. Single Point of Failure:
    • Broker Failure: If the broker fails or becomes a bottleneck, it can affect the entire system. High availability setups and redundancy are required to mitigate this risk.

TCP Connections:
- Pros: Lower latency, simpler for small-scale systems, and more resource-efficient without intermediary overhead.
- Cons: Poor scalability, complex reliability handling, manual load balancing, and increased connection management complexity.

Message Broker:
- Pros: Excellent scalability, built-in reliability and fault tolerance, automatic load balancing and distribution, and decoupling of producers and consumers.
- Cons: Higher latency due to intermediate hop, increased setup and maintenance complexity, and additional resource overhead.

  • TCP Connections: Suitable for small-scale, low-latency applications where simplicity and direct communication are key, and the overhead of managing many connections is minimal.
  • Message Broker: Ideal for large-scale, distributed systems requiring high scalability, reliability, fault tolerance, and loose coupling between producers and consumers. Examples include event-driven architectures, real-time data processing pipelines, and systems requiring high availability and persistent message storage.

By understanding these trade-offs, you can choose the approach that best fits your system’s requirements and constraints.

128
Q

Uses cases of event streaming

A
  1. Time windows (metrics and alarms)
  2. Change capture - first to database then to derived database
  3. Event sourcing - directly to broker; database agnostic events allows us to build new types of derived data in the future
129
Q

What’s the common solution to ensure exactly once message delivery in event streaming system

A

Ensuring exactly-once message delivery in an event streaming system involves a combination of techniques and careful system design:

•	Idempotent Producers: Ensure that message retries do not result in duplicates.
•	Transactions: Provide atomicity for producing and consuming operations.
•	Stream Processing Frameworks: Leverage built-in exactly-once processing guarantees.
•	Idempotent Consumers: Handle duplicates at the consumer level.

By combining these strategies, you can achieve exactly-once delivery, ensuring that each message is processed exactly once, even in the presence of failures and retries.

130
Q

What are the common message broker service

A

Memory based
Disk based

131
Q

Memory based message broker implementation

A

Use linked list.
Round robin delivery.

132
Q

Fan out.

A

In the context of computing and distributed systems, fan-out refers to the process of distributing or replicating data from a single source to multiple destinations. It involves taking a single input or a set of inputs and spreading them out to multiple receivers or systems. This concept is widely used in various areas such as messaging systems, data processing, and network design.

Key Contexts and Examples of Fan-Out

1.	Messaging Systems:
•	Definition: Fan-out in messaging systems involves a single message being sent from a producer to multiple consumers.
•	Example: In a publish-subscribe (pub-sub) messaging system, a producer publishes a message to a topic, and all subscribers to that topic receive the message. Systems like Apache Kafka, RabbitMQ, and Amazon SNS support this model.

Producer → Topic → Consumer 1
→ Consumer 2
→ Consumer 3

2.	Microservices:
•	Definition: In microservices architectures, fan-out refers to one service making requests to multiple downstream services.
•	Example: A user service might need to fetch user profile information, recent orders, and notifications from three different services.

User Service → Profile Service
→ Orders Service
→ Notifications Service

3.	Data Processing:
•	Definition: Fan-out in data processing involves a single data source feeding multiple processing pipelines or tasks.
•	Example: In an ETL (Extract, Transform, Load) pipeline, data might be extracted from a source and then simultaneously processed by multiple transformation steps.

Data Source → Transformation Task 1
→ Transformation Task 2
→ Transformation Task 3

4.	Network Design:
•	Definition: Fan-out can refer to the distribution of network traffic from a single point to multiple endpoints.
•	Example: A load balancer receiving incoming traffic and distributing it to multiple backend servers.

Load Balancer → Server 1
→ Server 2
→ Server 3

5.	Database Replication:
•	Definition: Fan-out in databases involves replicating data from a primary database to multiple replicas.
•	Example: A primary database might replicate its data to several read replicas to distribute read load.

Primary Database → Replica 1
→ Replica 2
→ Replica 3

Benefits of Fan-Out

1.	Scalability:
•	Distributing workload across multiple consumers or services can help scale the system to handle more load.
2.	Redundancy and Fault Tolerance:
•	Having multiple copies or paths for data can improve fault tolerance and ensure system reliability.
3.	Performance:
•	Parallel processing of data across multiple consumers or services can significantly improve performance and reduce processing time.
4.	Flexibility:
•	Allows different parts of a system to handle specific tasks independently, improving modularity and manageability.

Challenges of Fan-Out

1.	Complexity:
•	Managing multiple downstream systems or services increases system complexity, requiring more sophisticated coordination and monitoring.
2.	Consistency:
•	Ensuring data consistency across multiple consumers or replicas can be challenging, especially in distributed systems.
3.	Latency:
•	Fan-out can introduce additional latency, especially if the downstream systems or services have variable response times.
4.	Resource Utilization:
•	Distributing data to multiple destinations can lead to increased resource consumption, such as network bandwidth and processing power.
133
Q

Log based message broker implementation

A

Consumer offset
Durability
Replay

134
Q

How Flink Ensures Exactly-Once Processing

A
  1. Checkpointing and State Management:
    • Checkpoint Barriers: As mentioned earlier, checkpoint barriers are inserted into the data stream and flow through the operators. These barriers ensure that the state captured during a checkpoint is consistent across all operators.
    • State Snapshot: Each operator takes a snapshot of its state when it encounters a checkpoint barrier and stores this state in the state backend.
    1. Replaying Events from Sources:
      • Kafka Source Example: In the case of a Kafka source, Flink commits offsets only when a checkpoint is successfully completed. This means that if a failure occurs and Flink resumes processing from the last completed checkpoint, it will start consuming messages from the offsets corresponding to that checkpoint. Any messages processed after the checkpoint and before the failure will be replayed.
    2. Idempotent Processing:
      • Idempotent Operations: Operators in Flink should ideally be idempotent, meaning that processing the same message multiple times does not change the outcome. This helps in scenarios where messages might be replayed after a failure.
      • Deduplication: In some cases, deduplication logic can be implemented within operators to ensure that duplicate messages are not processed more than once.
    3. Transactional Sinks:
      • Two-Phase Commit Protocol: Flink uses a two-phase commit protocol for transactional sinks. During the first phase (pre-commit), the sink receives messages but does not commit them. Once a checkpoint is completed, the sink moves to the second phase (commit), ensuring that the messages are durably written and acknowledged.
      • Exactly-Once Guarantees: This approach ensures that even if messages are replayed after a failure, they are only committed once, providing exactly-once semantics.
135
Q

Search indexes

A

Tokenize;
Inverted index - from token to list of document IDs
Prefix searching/suffix searching - keeping tokens sorted gives log time complexity.

136
Q

Apache Lucene

A

Most popular open source search index started in 1999.

Many types of indexes supported for complicated variants of search. Texts, numbers, coordinates, etc.

Uses an LSM tree variant to support fast document ingestion, writes first go to memory.

137
Q

Elasticsearch

A

Wrapper of Apache Lucene; APIs, query syntax, monitoring tools

138
Q

Time series DB

A

Time scale DB; influx DB; Apache Drvid

column oriented table

As opposed to one large table with one big indexes, use many small indexes.

Hyper table.

139
Q

Graph database

A

Neo 4J implementation

140
Q

Geospatial indexes

A

Geo hashes / quad trees

Geo Sharding

141
Q

Cache

A

Local cache - faster but bounded by the instance count; if instance goes down cache is lost
Global cache - can scale independently; requires network call.

142
Q

Write around cache

A

A write-around cache is a caching strategy used in computer systems to manage how data is written to and read from the cache and the backing storage (such as a disk or main memory). In this strategy, write operations bypass the cache and are directly written to the backing storage. The cache is only populated on read misses. This approach is often used to reduce the write load on the cache and ensure that the cache contains only frequently accessed data.

  1. Write Operations:
    • When a write operation occurs, the data is written directly to the backing storage.
    • The cache is not updated or populated during write operations.
  2. Read Operations:
    • If the data is already in the cache, it is read from the cache, providing fast access.
    • If the data is not in the cache (a read miss), it is read from the backing storage and then cached for future access.
  3. Cache Population:
    • The cache is populated only on read misses, meaning that only data that has been read at least once will be stored in the cache.
    • This can help ensure that the cache is filled with frequently accessed data, which can improve read performance.
  1. Reduced Write Load:
    • By bypassing the cache on write operations, the write load on the cache is reduced, which can improve the overall performance of write operations.
    • This can also reduce the wear on the cache in systems where the cache has limited write endurance (such as in flash memory).
  2. Improved Cache Efficiency:
    • The cache is more likely to contain frequently read data because it is populated only on read misses.
    • This can improve read performance, as the cache will be optimized for frequently accessed data.
  3. Simplicity:
    • The implementation of write-around caching can be simpler compared to other caching strategies that require more complex coherence mechanisms for managing cache updates on writes.
  1. Potential for Read Misses:
    • Since write operations do not update the cache, there is a higher likelihood of read misses if the recently written data is accessed soon after the write.
    • This can lead to increased read latency for data that is frequently written and then read.
  2. Increased Latency for Write-Read Patterns:
    • For workloads where data is frequently written and then read shortly afterward, the write-around cache can introduce additional latency because the data has to be fetched from the slower backing storage on the first read after a write.
  1. Data Warehousing:
    • In data warehousing applications where large amounts of data are written during batch updates but read more frequently during query operations, write-around caching can help optimize read performance while reducing the write load on the cache.
  2. Content Delivery Networks (CDNs):
    • In CDNs, where content is updated infrequently but read frequently by users, write-around caching can ensure that the cache is populated with popular content that is accessed often.
  1. Write-Through Cache:
    • In a write-through cache, data is written to both the cache and the backing storage simultaneously. This ensures data consistency between the cache and the storage but can increase write latency.
  2. Write-Back Cache:
    • In a write-back cache, data is written to the cache first and then written to the backing storage at a later time. This can improve write performance but requires more complex mechanisms to ensure data consistency and manage cache coherency.

A write-around cache is a caching strategy where write operations bypass the cache and are written directly to the backing storage, while the cache is only populated on read misses. This approach reduces the write load on the cache and ensures that the cache contains frequently accessed data, which can improve read performance. However, it may lead to increased read latency for data that is frequently written and then read shortly afterward. Write-around caching is suitable for applications where read operations are more frequent than write operations and where data consistency can be managed without complex coherence mechanisms.

143
Q

Write through cache

A

Data consistency between case and database.

Can have correctness issues if not using two phase commit; slowest write method.

144
Q

Write back cache

A

Lowest latency writes
Data staleness/correct issues.

145
Q

Cache eviction policies

A
  1. First in first out - easy to implement
  2. Least recently used - most commonly used in practice
  3. Least frequently used
146
Q

Redis versus Memcache

A

Memcache - consistent hash ring; LRU eviction; multi-threading

Redis - feature rich (hash maps, sorted sets, geo indexes). Fixed number of partitions via gossip protocol. Write ahead log; allow transactions. Single threaded, actual serial execution. Single leader replication.

147
Q

Push versus Pull CDN

A

If we know the content is gonna be popular, do push CDN.

148
Q

Why Hadoop HDFS is not good as binary object storage?

A

Hadoop Distributed File System (HDFS) and object storage serve different purposes and have distinct design philosophies. While HDFS is optimized for certain types of workloads, it is not ideally suited for object storage due to several key limitations. Here are the main reasons why HDFS is not good as object storage:

  • NameNode Scalability: In HDFS, the NameNode maintains metadata about the filesystem, including the directory structure and file-to-block mappings. This centralization of metadata can become a bottleneck as the number of files increases, limiting the scalability of HDFS in scenarios with a massive number of small objects.
  • Metadata Overhead: Object storage systems, such as Amazon S3 or Google Cloud Storage, are designed to handle billions of objects efficiently with distributed metadata management. HDFS, with its single NameNode architecture (although High Availability configurations exist), struggles with the metadata overhead when managing numerous small files.
  • Small File Problem: HDFS is designed to handle large files and is optimized for high-throughput access to large datasets. When dealing with a large number of small files, HDFS becomes inefficient because each file generates a separate metadata entry in the NameNode, increasing memory usage and reducing performance.
  • Block Size: HDFS uses a block-based storage system with a default block size (often 128 MB or 256 MB). Storing small files that are much smaller than the block size results in significant storage inefficiency and increased metadata burden on the NameNode.
  • Sequential Access: HDFS is optimized for sequential read/write access patterns, which is ideal for big data analytics workloads where large datasets are processed in bulk. However, object storage systems are designed to provide efficient random access to objects, which is crucial for applications that need to retrieve individual objects quickly.
  • Latency: Object storage systems often have lower latency for accessing individual objects, whereas HDFS may exhibit higher latency for such operations due to its design focus on throughput rather than latency.
  • File System Semantics: HDFS provides a POSIX-like file system interface, which includes a hierarchical directory structure and file-based operations. Object storage systems, on the other hand, use a flat namespace and provide RESTful APIs for object operations, making them more suitable for modern cloud-native applications.
  • RESTful API: Object storage services like Amazon S3 offer RESTful APIs that are easy to integrate with web applications and other cloud services. HDFS, while it does provide APIs, is more complex to interface with compared to the simple and standardized APIs of object storage.
  • Replication: HDFS uses a replication mechanism to ensure data durability, typically replicating data blocks three times across different nodes. This approach, while effective, can be less storage-efficient compared to the erasure coding techniques used by many object storage systems.
  • Geo-Redundancy: Object storage systems often provide built-in geo-redundancy, automatically replicating data across multiple geographic regions to ensure high availability and disaster recovery. HDFS requires additional configuration and management to achieve similar levels of redundancy and availability.
  • Scalability Limits: While HDFS can scale horizontally by adding more DataNodes, the scalability of the NameNode remains a limiting factor, particularly in large clusters with a high number of files. Object storage systems are designed to scale infinitely by distributing both data and metadata across a highly distributed architecture.
  • Elasticity: Object storage services in the cloud offer elastic scalability, allowing you to scale storage capacity up or down based on demand without manual intervention. HDFS requires more manual management and planning to scale the storage infrastructure effectively.

While HDFS is a robust distributed file system designed for high-throughput data processing and big data analytics, it is not well-suited for use as an object storage system due to limitations in handling metadata, inefficiency with small files, higher latency for random access, differences in API interfaces, and challenges in scalability and elasticity. Object storage systems are specifically designed to address these challenges, making them a better choice for use cases that require efficient, scalable, and low-latency access to a large number of objects.

149
Q

Data lake versus data warehouse

A

Data lakes and data warehouses are both storage solutions for managing large volumes of data, but they serve different purposes and have distinct characteristics. Here’s a detailed comparison between data lakes and data warehouses:

  • Data Lake:
    • A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store data as-is, without structuring it first, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.
  • Data Warehouse:
    • A data warehouse is a centralized repository designed to store structured data that has been processed and formatted for a specific purpose. It supports fast query performance and is optimized for reporting and analysis.
  • Data Lake:
    • Supports structured, semi-structured, and unstructured data, such as logs, XML, JSON, images, videos, and more.
  • Data Warehouse:
    • Primarily handles structured data, which is cleaned, processed, and formatted before being loaded into the warehouse. It supports semi-structured data to some extent, but not unstructured data.
  • Data Lake:
    • Schema-on-read: The schema is applied to the data when it is read. This allows for flexible and dynamic data ingestion.
  • Data Warehouse:
    • Schema-on-write: The schema is defined before data is written to the warehouse. This ensures data consistency and structure but requires upfront planning and processing.
  • Data Lake:
    • Can store raw data without any preprocessing. This raw data can later be processed and analyzed using various tools and frameworks.
  • Data Warehouse:
    • Requires data to be cleaned, transformed, and structured before being loaded (ETL process: Extract, Transform, Load).
  • Data Lake:
    • Generally cheaper to store large volumes of data because it uses low-cost storage solutions and does not require extensive preprocessing.
  • Data Warehouse:
    • More expensive due to the cost of processing, storage optimizations, and the need for high-performance hardware and software.
  • Data Lake:
    • Can handle large volumes of data and is scalable. However, query performance can be slower because the data is not pre-processed or indexed.
  • Data Warehouse:
    • Optimized for fast query performance, with indexing, partitioning, and other optimizations applied to the structured data. Suitable for complex queries and analysis.
  • Data Lake:
    • Ideal for storing vast amounts of raw data from various sources, data exploration, data science, and machine learning. Suitable for unstructured and semi-structured data.
  • Data Warehouse:
    • Best for business intelligence, reporting, and structured data analysis. Suitable for historical data analysis, operational reporting, and data integration from multiple structured sources.
  • Data Lake:
    • Data can be accessed and processed using various big data tools and frameworks like Hadoop, Spark, and other analytics engines.
  • Data Warehouse:
    • Data is accessed using SQL-based tools and applications optimized for complex queries and business intelligence tools.
  • Data Lake:
    • Amazon S3, Azure Data Lake, Google Cloud Storage, Hadoop Distributed File System (HDFS).
  • Data Warehouse:
    • Amazon Redshift, Google BigQuery, Snowflake, Microsoft Azure SQL Data Warehouse, Oracle Exadata.
  • Data Lake:
    • Data governance can be more challenging due to the variety of data types and the lack of enforced schema. Security measures need to be tailored to handle raw and potentially sensitive data.
  • Data Warehouse:
    • Typically has more robust data governance and security features in place due to the structured nature of the data and the need for regulatory compliance and data integrity.
  • Data Lake:
    • Strengths: Flexible storage for a variety of data types, cost-effective, suitable for big data analytics and machine learning.
    • Weaknesses: Potentially slower query performance, complex data governance and security, requires specialized tools for data processing.
  • Data Warehouse:
    • Strengths: Optimized for fast query performance, robust data governance and security, suitable for business intelligence and structured data analysis.
    • Weaknesses: More expensive, requires data to be pre-processed and structured, less flexible in handling unstructured data.

The choice between a data lake and a data warehouse depends on the specific needs of your organization, the nature of your data, and the types of analytics you want to perform. Many organizations use both in a complementary manner, where a data lake serves as a repository for raw data and a data warehouse is used for structured data analysis and reporting.

150
Q

Reverse proxy versus load balancer

A

Load balancers and reverse proxies are both crucial components in modern web infrastructure, often used to distribute and manage network traffic. While they share some similarities, they serve different primary purposes and have distinct functionalities. Here’s a detailed comparison:

Purpose:
- A load balancer is designed to distribute incoming network traffic across multiple servers to ensure no single server becomes overwhelmed, enhancing the availability and reliability of applications.

Key Functions:
1. Traffic Distribution:
- Distributes client requests among multiple backend servers to balance the load and prevent any single server from becoming a bottleneck.

  1. Improved Availability:
    • Ensures high availability by redirecting traffic to healthy servers if one or more servers fail.
  2. Scalability:
    • Allows scaling out of resources by adding more servers to the backend pool, handling increased traffic seamlessly.
  3. Health Monitoring:
    • Continuously checks the health of backend servers to route traffic only to operational servers.

Types of Load Balancers:
1. Layer 4 Load Balancer:
- Operates at the transport layer (OSI Layer 4). It makes routing decisions based on IP address and TCP/UDP port.
- Examples: AWS Elastic Load Balancer (ELB), HAProxy (in Layer 4 mode).

  1. Layer 7 Load Balancer:
    • Operates at the application layer (OSI Layer 7). It makes routing decisions based on HTTP headers, cookies, URL paths, and other content data.
    • Examples: Nginx, Apache HTTP Server, AWS Application Load Balancer (ALB).

Purpose:
- A reverse proxy is designed to act as an intermediary for requests from clients seeking resources from backend servers. It provides various functionalities such as load balancing, caching, SSL termination, and more.

Key Functions:
1. Load Balancing:
- Distributes incoming traffic among multiple backend servers (a common feature of many reverse proxies).

  1. SSL Termination:
    • Handles SSL/TLS encryption and decryption, offloading this resource-intensive process from backend servers.
  2. Caching:
    • Caches static and dynamic content to reduce the load on backend servers and improve response times.
  3. Security:
    • Acts as a security barrier, hiding the backend server details from clients and protecting against DDoS attacks, web threats, and other vulnerabilities.
  4. Compression:
    • Compresses outbound responses to reduce bandwidth usage and improve load times.
  5. Content Routing:
    • Routes requests based on URL paths, headers, and other request attributes.

Examples of Reverse Proxies:
- Nginx
- Apache HTTP Server
- HAProxy (in Layer 7 mode)
- Varnish

Load Balancer:
- Ideal for scenarios where the primary need is to distribute incoming requests across multiple servers to balance the load, ensure availability, and improve scalability. Common in high-traffic websites, enterprise applications, and cloud services.

Reverse Proxy:
- Suitable for scenarios requiring additional functionalities like SSL termination, caching, compression, security enhancements, and content-based routing. Often used in combination with load balancers for comprehensive traffic management and optimization.

While load balancers and reverse proxies can overlap in their functionalities, especially in traffic distribution, they serve different primary purposes. Load balancers focus on distributing traffic to ensure high availability and scalability, whereas reverse proxies provide a broader range of features including security, caching, SSL termination, and more. Depending on the specific needs of your infrastructure, you may use one or both in conjunction to optimize performance and manage traffic effectively.

151
Q

Active-active load balancing versus active-passive load balancing

A

Active-active and active-passive load balancing are two strategies for distributing traffic among servers or other resources to ensure high availability and reliability of applications. Here’s a detailed comparison of these two approaches:

Active-Active Load Balancing

Definition:

•	In an active-active load balancing configuration, all servers or resources are active and handle traffic simultaneously. The load balancer distributes incoming requests across all available servers.

Key Characteristics:

1.	High Availability:
•	Since all servers are active, there is no single point of failure. If one server fails, the load balancer redistributes the traffic among the remaining servers.
2.	Scalability:
•	Active-active configurations can handle increased traffic by adding more servers. The load is evenly distributed, which can improve performance and capacity.
3.	Resource Utilization:
•	All resources are utilized effectively, ensuring that none of the servers remain idle.
4.	Load Distribution:
•	Traffic is balanced across multiple servers using algorithms such as round-robin, least connections, or hash-based distribution.
5.	Complexity:
•	Active-active setups can be more complex to manage, especially when it comes to data consistency and synchronization between servers.

Pros:

•	Improved performance and capacity.
•	High availability with no single point of failure.
•	Efficient use of resources.

Cons:

•	More complex to configure and manage.
•	Requires robust mechanisms for data consistency and synchronization.

Use Cases:

•	High-traffic websites and applications.
•	Scenarios where performance and high availability are critical.
•	Systems that can handle concurrent data processing without consistency issues.

Active-Passive Load Balancing

Definition:

•	In an active-passive load balancing configuration, one or more servers (active) handle all the traffic while the other servers (passive) remain idle and act as backups. The passive servers are only activated if the active server(s) fail.

Key Characteristics:

1.	High Availability:
•	Provides high availability by having standby servers ready to take over in case of failure of the active server(s).
2.	Failover:
•	In the event of a failure, the load balancer detects the failure and redirects traffic to the passive server(s), which then become active.
3.	Resource Utilization:
•	Passive servers remain idle until needed, leading to underutilization of resources in normal operation.
4.	Simplicity:
•	Easier to configure and manage compared to active-active setups. There are fewer concerns about data consistency and synchronization since only one server (or a set of active servers) handles traffic at a time.
5.	Latency During Failover:
•	There may be a brief period of downtime or increased latency during the failover process as the passive server(s) are brought online and start handling traffic.

Pros:

•	Simpler configuration and management.
•	Easier to maintain data consistency since only one server is handling traffic at any given time.
•	Reliable failover mechanism for high availability.

Cons:

•	Underutilization of passive servers.
•	Possible latency or downtime during failover.

Use Cases:

•	Applications where simplicity and ease of management are more important than maximum resource utilization.
•	Systems where data consistency is critical and easier to maintain with a single active server.
•	Scenarios where failover mechanisms need to be reliable but not necessarily instantaneous.

Conclusion

•	Active-Active Load Balancing: Suitable for high-traffic and performance-critical applications that require high availability and efficient resource utilization. It is more complex to manage but provides better performance and scalability.
•	Active-Passive Load Balancing: Ideal for applications where simplicity and ease of management are prioritized, and where data consistency is crucial. It ensures high availability through reliable failover mechanisms, though it may result in underutilized resources and potential latency during failover.

The choice between active-active and active-passive load balancing depends on the specific needs of the application, including performance requirements, complexity, resource utilization, and the importance of data consistency.

152
Q

Active-active load balancing versus active-passive load balancing

A

Active-active and active-passive load balancing are two strategies for distributing traffic among servers or other resources to ensure high availability and reliability of applications. Here’s a detailed comparison of these two approaches:

Active-Active Load Balancing

Definition:

•	In an active-active load balancing configuration, all servers or resources are active and handle traffic simultaneously. The load balancer distributes incoming requests across all available servers.

Key Characteristics:

1.	High Availability:
•	Since all servers are active, there is no single point of failure. If one server fails, the load balancer redistributes the traffic among the remaining servers.
2.	Scalability:
•	Active-active configurations can handle increased traffic by adding more servers. The load is evenly distributed, which can improve performance and capacity.
3.	Resource Utilization:
•	All resources are utilized effectively, ensuring that none of the servers remain idle.
4.	Load Distribution:
•	Traffic is balanced across multiple servers using algorithms such as round-robin, least connections, or hash-based distribution.
5.	Complexity:
•	Active-active setups can be more complex to manage, especially when it comes to data consistency and synchronization between servers.

Pros:

•	Improved performance and capacity.
•	High availability with no single point of failure.
•	Efficient use of resources.

Cons:

•	More complex to configure and manage.
•	Requires robust mechanisms for data consistency and synchronization.

Use Cases:

•	High-traffic websites and applications.
•	Scenarios where performance and high availability are critical.
•	Systems that can handle concurrent data processing without consistency issues.

Active-Passive Load Balancing

Definition:

•	In an active-passive load balancing configuration, one or more servers (active) handle all the traffic while the other servers (passive) remain idle and act as backups. The passive servers are only activated if the active server(s) fail.

Key Characteristics:

1.	High Availability:
•	Provides high availability by having standby servers ready to take over in case of failure of the active server(s).
2.	Failover:
•	In the event of a failure, the load balancer detects the failure and redirects traffic to the passive server(s), which then become active.
3.	Resource Utilization:
•	Passive servers remain idle until needed, leading to underutilization of resources in normal operation.
4.	Simplicity:
•	Easier to configure and manage compared to active-active setups. There are fewer concerns about data consistency and synchronization since only one server (or a set of active servers) handles traffic at a time.
5.	Latency During Failover:
•	There may be a brief period of downtime or increased latency during the failover process as the passive server(s) are brought online and start handling traffic.

Pros:

•	Simpler configuration and management.
•	Easier to maintain data consistency since only one server is handling traffic at any given time.
•	Reliable failover mechanism for high availability.

Cons:

•	Underutilization of passive servers.
•	Possible latency or downtime during failover.

Use Cases:

•	Applications where simplicity and ease of management are more important than maximum resource utilization.
•	Systems where data consistency is critical and easier to maintain with a single active server.
•	Scenarios where failover mechanisms need to be reliable but not necessarily instantaneous.

Conclusion

•	Active-Active Load Balancing: Suitable for high-traffic and performance-critical applications that require high availability and efficient resource utilization. It is more complex to manage but provides better performance and scalability.
•	Active-Passive Load Balancing: Ideal for applications where simplicity and ease of management are prioritized, and where data consistency is crucial. It ensures high availability through reliable failover mechanisms, though it may result in underutilized resources and potential latency during failover.

The choice between active-active and active-passive load balancing depends on the specific needs of the application, including performance requirements, complexity, resource utilization, and the importance of data consistency.

153
Q

UDP versus TCP

A

User data gram protocol - simple, fast, checksums to avoid corruption. Unreliable.

Transmission Control Protocol - reliable two way broadcast; 3 way handshake. Sequence number.

154
Q

Flow and congestion control

A

Help avoid overloading network buffer on receiver.

155
Q

HTTP long polling

A

HTTP long polling is a technique used to maintain a persistent connection between a client and a server to enable real-time updates. It’s a way to emulate server push notifications in environments where WebSockets or other advanced communication protocols are not available. Here’s a detailed explanation of how HTTP long polling works and its key characteristics:

  1. Client Request:
    • The client sends an HTTP request to the server, typically a GET request, asking for new information or updates.
  2. Server Response:
    • Instead of responding immediately, the server holds the request open until it has new information to send back to the client. This can result in the server keeping the connection open for a long time (hence “long polling”).
  3. Data Availability:
    • When the server has new data or an update, it sends the response to the client with the new information.
  4. Client Handling:
    • Upon receiving the response, the client processes the new information.
  5. Reconnection:
    • The client immediately sends another request to the server, and the process repeats. This ensures that the client is always ready to receive new data as soon as it’s available.
  1. Persistent Connection:
    • The connection between the client and server is kept open for a long period until the server has new data to send.
  2. Real-Time Updates:
    • Enables real-time updates by reducing the latency between the time new data is available on the server and when the client receives it.
  3. Compatibility:
    • Compatible with HTTP/1.1 and works with most web servers and clients without requiring special configurations.
  4. Resource Consumption:
    • Can be resource-intensive on the server because it needs to maintain many open connections simultaneously. Proper server resource management and connection handling are essential.
  1. Near Real-Time Communication:
    • Provides a way to achieve near real-time communication without the need for more complex protocols like WebSockets.
  2. Simplicity:
    • Simpler to implement and more compatible with existing HTTP infrastructure compared to WebSockets.
  3. Fallback Mechanism:
    • Often used as a fallback mechanism for environments where WebSockets are not supported or available.
  1. Server Resource Usage:
    • Holding connections open can consume significant server resources, especially when handling many clients.
  2. Latency:
    • While it reduces latency compared to regular polling, it can still introduce some delay compared to WebSockets because a new connection is established with each request.
  3. Scalability:
    • Scaling applications using long polling can be challenging due to the high number of open connections that need to be managed.
  1. Chat Applications:
    • Long polling is commonly used in chat applications to receive new messages in real time.
  2. Notifications:
    • Suitable for systems that need to push notifications or updates to clients as soon as they are available.
  3. Live Data Feeds:
    • Used in applications that need to provide live data updates, such as sports scores, stock prices, or news feeds.

Here’s a simplified example of how long polling might be implemented in JavaScript and a server-side language like Node.js:

```javascript
function longPoll() {
fetch(‘/poll-endpoint’)
.then(response => response.json())
.then(data => {
console.log(‘New data:’, data);
// Process the new data…

        // Immediately send another long poll request
        longPoll();
    })
    .catch(error => {
        console.error('Error:', error);
        // Retry after a delay
        setTimeout(longPoll, 5000);
    }); }

// Start the long polling process
longPoll();
~~~

```javascript
const express = require(‘express’);
const app = express();
const port = 3000;

let clients = [];

// Endpoint for long polling
app.get(‘/poll-endpoint’, (req, res) => {
// Add the client response to the list of clients
clients.push(res);

// Set timeout to close the connection after a certain time to avoid hanging forever
setTimeout(() => {
    const index = clients.indexOf(res);
    if (index !== -1) {
        clients.splice(index, 1);
        res.json({ message: 'No new data' });
    }
}, 30000); // 30 seconds timeout });

// Simulate new data generation
setInterval(() => {
if (clients.length > 0) {
const data = { message: ‘New data available’ };
// Send new data to all waiting clients
clients.forEach(res => res.json(data));
// Clear the clients array
clients = [];
}
}, 10000); // New data every 10 seconds

app.listen(port, () => {
console.log(Server running at http://localhost:${port}/);
});
~~~

HTTP long polling is a technique to achieve near real-time communication between a client and a server by holding a connection open until new data is available. It is simple to implement and compatible with existing HTTP infrastructure but can be resource-intensive and less efficient compared to more modern techniques like WebSockets. Long polling is often used in chat applications, notification systems, and live data feeds where real-time updates are essential.

156
Q

Web sockets

A

WebSockets is a communication protocol that provides full-duplex communication channels over a single TCP connection. Unlike traditional HTTP requests which follow a request-response model, WebSockets allow for persistent, two-way communication between the client and server. This makes WebSockets particularly suitable for applications that require real-time updates, such as live chat applications, online gaming, and real-time data feeds.

  1. Full-Duplex Communication:
    • WebSockets support full-duplex communication, allowing both the client and server to send and receive messages independently at any time.
  2. Single TCP Connection:
    • WebSockets use a single, long-lived TCP connection that remains open, enabling continuous data exchange without the overhead of establishing new connections.
  3. Low Latency:
    • By maintaining a persistent connection, WebSockets reduce the latency typically associated with establishing new connections for each message, resulting in faster communication.
  4. Bidirectional Data Flow:
    • Both the client and server can initiate communication, allowing for more interactive and responsive applications.
  1. Handshake:
    • The WebSocket connection begins with a handshake between the client and server. This handshake is initiated by the client using an HTTP request that includes an Upgrade header, requesting to switch the connection to the WebSocket protocol.
    • If the server supports WebSockets, it responds with an HTTP 101 status code and an Upgrade header, establishing the WebSocket connection.
  2. Data Frames:
    • Once the connection is established, data can be exchanged in the form of frames. These frames are lightweight and can carry both text and binary data.
    • WebSockets define a frame format that includes metadata such as the opcode (indicating the type of data) and payload length, allowing efficient data transfer.
  3. Persistent Connection:
    • The WebSocket connection remains open until either the client or server decides to close it. This persistent connection enables continuous, real-time communication.
  1. Client Initiates Handshake:
    • The client sends an HTTP request to the server with an Upgrade header:
      GET /chat HTTP/1.1
      Host: server.example.com
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
      Sec-WebSocket-Version: 13
  2. Server Responds:
    • The server responds with an HTTP 101 status code, upgrading the connection to WebSockets:
      HTTP/1.1 101 Switching Protocols
      Upgrade: websocket
      Connection: Upgrade
      Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=
  1. Client Sends a Message:
    • The client sends a message to the server using the WebSocket connection:
      javascript
      const socket = new WebSocket('ws://server.example.com/chat');
      socket.onopen = () => {
          socket.send('Hello, Server!');
      };
  2. Server Receives and Responds:
    • The server receives the message and can respond back through the same WebSocket connection:
      javascript
      socket.onmessage = (event) => {
          console.log('Message from server:', event.data);
      };
  1. Live Chat Applications:
    • WebSockets enable real-time messaging between users in chat applications, providing a seamless and interactive user experience.
  2. Online Gaming:
    • In online multiplayer games, WebSockets facilitate real-time communication between players and the game server, ensuring low-latency interactions.
  3. Real-Time Data Feeds:
    • WebSockets are used to push real-time updates for stock prices, sports scores, and other continuously updating data streams.
  4. Collaborative Tools:
    • Tools like collaborative document editing and real-time drawing applications use WebSockets to synchronize changes between multiple users in real time.
  5. IoT Devices:
    • WebSockets provide a lightweight and efficient way for IoT devices to communicate with servers and other devices, supporting real-time data transfer and control.
  • Low Latency: Maintains a persistent connection, reducing the overhead of establishing new connections.
  • Efficient Communication: Supports full-duplex communication, enabling real-time data exchange.
  • Reduced Server Load: Minimizes the number of HTTP requests, reducing server load and bandwidth usage.
  • Scalability: Suitable for applications that require handling a large number of concurrent connections.
  • Complexity: Requires more complex server and client implementations compared to traditional HTTP.
  • Firewall and Proxy Issues: Some firewalls and proxies may not support or may interfere with WebSocket connections.
  • Resource Management: Persistent connections can consume more server resources, requiring efficient management to handle large numbers of connections.

WebSockets provide a powerful and efficient protocol for real-time, bidirectional communication over a single TCP connection. They are well-suited for applications requiring low-latency interactions, such as live chat, online gaming, real-time data feeds, and collaborative tools. By maintaining a persistent connection, WebSockets reduce latency and improve the performance of interactive applications, making them an essential technology for modern web development.

157
Q

Server sent events

A

Server-Sent Events (SSE) is a standard for server-to-client streaming of events. It allows servers to push real-time updates to web clients over a single HTTP connection. SSE is built on top of the HTTP protocol and is simpler to implement compared to WebSockets for certain use cases that require one-way communication from the server to the client.

  1. One-Way Communication:
    • SSE supports one-way communication from the server to the client. The server can push updates to the client, but the client cannot send data back to the server through the same connection.
  2. Persistent Connection:
    • The client establishes a persistent connection to the server, which remains open. The server can continuously send updates over this connection.
  3. Text-Based Protocol:
    • SSE uses a text-based protocol where events are sent as text messages. Each message can include fields such as event, data, id, and retry.
  4. Automatic Reconnection:
    • If the connection is lost, the client automatically attempts to reconnect to the server. The server can include a retry field to specify the reconnection delay.
  5. Event Stream:
    • Events are sent as a continuous stream of text data. Each event is separated by two newline characters, and each field within an event is separated by a newline character.
  1. Client Initiates Connection:
    • The client initiates the connection by sending an HTTP GET request to the server with the Accept header set to text/event-stream.
    • Example:
      http
      GET /events HTTP/1.1
      Host: server.example.com
      Accept: text/event-stream
  2. Server Responds:
    • The server responds with an HTTP 200 status code and sets the Content-Type header to text/event-stream.
    • Example:
      http
      HTTP/1.1 200 OK
      Content-Type: text/event-stream
  3. Server Sends Events:
    • The server sends events as text data over the open connection. Each event can contain multiple fields.
    • Example:
      ```text
      data: This is a messageevent: customEvent
      data: This is a custom eventid: 12345
      data: This is another message with an IDretry: 5000
      data: This is a message with a custom retry interval
      ```
  4. Client Handles Events:
    • The client receives the events and processes them using JavaScript. The EventSource interface is used to establish the connection and handle incoming events.

```javascript
// Create a new EventSource instance
const eventSource = new EventSource(‘/events’);

// Listen for messages
eventSource.onmessage = function(event) {
console.log(‘Message:’, event.data);
};

// Listen for custom events
eventSource.addEventListener(‘customEvent’, function(event) {
console.log(‘Custom event:’, event.data);
});

// Handle errors
eventSource.onerror = function(event) {
console.error(‘Error:’, event);
};
~~~

```javascript
const express = require(‘express’);
const app = express();
const port = 3000;

app.get(‘/events’, (req, res) => {
res.setHeader(‘Content-Type’, ‘text/event-stream’);
res.setHeader(‘Cache-Control’, ‘no-cache’);
res.setHeader(‘Connection’, ‘keep-alive’);

// Function to send an event
const sendEvent = (data, event = null, id = null, retry = null) => {
    if (event) res.write(`event: ${event}\n`);
    if (id) res.write(`id: ${id}\n`);
    if (retry) res.write(`retry: ${retry}\n`);
    res.write(`data: ${data}\n\n`);
};

// Send initial message
sendEvent('Connected to server');

// Send periodic updates
setInterval(() => {
    sendEvent('Periodic update', 'update', Date.now().toString());
}, 10000);

// Handle client disconnect
req.on('close', () => {
    console.log('Client disconnected');
    res.end();
}); });

app.listen(port, () => {
console.log(Server running at http://localhost:${port}/);
});
~~~

  1. Real-Time Notifications:
    • SSE is suitable for applications that require real-time notifications, such as social media updates, chat messages, or system alerts.
  2. Live Data Feeds:
    • Applications that need to display live data, such as stock tickers, sports scores, or news feeds, can benefit from SSE.
  3. Monitoring Dashboards:
    • SSE can be used to push updates to monitoring dashboards, providing real-time metrics and status information.
  4. Event Logs:
    • SSE is useful for streaming event logs from servers to clients for real-time monitoring and analysis.
  • Simplicity: SSE is simpler to implement compared to WebSockets, especially for one-way communication scenarios.
  • Automatic Reconnection: The client automatically reconnects if the connection is lost, improving reliability.
  • Text-Based Protocol: Easy to debug and understand, as events are sent as plain text.
  • One-Way Communication: SSE only supports server-to-client communication. For bidirectional communication, WebSockets may be a better choice.
  • Limited Browser Support: While widely supported, some older browsers may not support SSE.
  • Message Size Limitations: SSE may have limitations on the size of individual messages, depending on the server and client implementations.

Server-Sent Events (SSE) is a protocol for server-to-client streaming of events over a single HTTP connection. It supports real-time updates, automatic reconnection, and a simple text-based format, making it ideal for applications that need continuous, one-way data flow from the server to the client. While it is easier to implement than WebSockets for certain use cases, it is limited to one-way communication and may have some compatibility and message size constraints.

158
Q

Dockers and Kubernetes

A

Set the working directory

Overview:
Docker is a platform designed to make it easier to create, deploy, and run applications by using containers. Containers allow developers to package applications with all their dependencies and configurations, ensuring that the application runs consistently across different environments.

Key Features:

  1. Containerization:
    • Docker allows you to package an application and its dependencies into a single container. This container can run on any machine that has Docker installed, ensuring consistency and eliminating environment-related issues.
  2. Isolation:
    • Each Docker container runs in its own isolated environment. This isolation ensures that applications do not interfere with each other, enhancing security and stability.
  3. Efficiency:
    • Containers are lightweight and share the host system’s kernel, making them more efficient than traditional virtual machines (VMs). This allows for faster startup times and better resource utilization.
  4. Portability:
    • Docker containers can run on any system that supports Docker, whether it’s a developer’s laptop, a testing server, or a production environment. This portability simplifies the development and deployment process.
  5. Version Control:
    • Docker images (blueprints for containers) can be versioned, making it easy to track changes and roll back to previous versions if necessary.

Use Cases:
- Microservices architecture.
- Continuous Integration and Continuous Deployment (CI/CD).
- Development and testing environments.
- Running legacy applications.

Example:
A Dockerfile is used to define the environment and steps to build a Docker image.

```Dockerfile
# Use an official Python runtime as a parent image
FROM python:3.8-slim

WORKDIR /app

COPY . /app

RUN pip install –no-cache-dir -r requirements.txt

EXPOSE 80

ENV NAME World

CMD [“python”, “app.py”]
~~~

Overview:
Kubernetes (often abbreviated as K8s) is an open-source platform for automating the deployment, scaling, and management of containerized applications. It was originally developed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF).

Key Features:

  1. Orchestration:
    • Kubernetes automates the deployment and management of containers, ensuring that the desired state of your application is maintained. It handles the distribution of containers across a cluster of machines, scaling them up or down as needed.
  2. Scalability:
    • Kubernetes can automatically scale your applications up or down based on demand, ensuring that your application can handle varying levels of traffic without manual intervention.
  3. Self-Healing:
    • Kubernetes continuously monitors the health of your application and can automatically replace failed containers, restart them if they crash, and reschedule them if a node in the cluster fails.
  4. Load Balancing and Service Discovery:
    • Kubernetes provides built-in load balancing to distribute traffic across multiple instances of your application. It also includes service discovery mechanisms to ensure that containers can find and communicate with each other.
  5. Declarative Configuration:
    • Kubernetes uses declarative configuration files (usually written in YAML or JSON) to define the desired state of your application. This makes it easier to manage complex deployments and version control your infrastructure.

Components:

  • Pods: The smallest deployable units in Kubernetes, representing a single instance of a running process in a container.
  • Services: Abstractions that define a logical set of pods and a policy to access them.
  • Deployments: Controllers that manage the deployment and scaling of a set of pods.
  • Nodes: Worker machines in a Kubernetes cluster, which can be physical or virtual.

Use Cases:
- Managing large-scale microservices architectures.
- Automating deployment and scaling of containerized applications.
- Ensuring high availability and fault tolerance.
- Simplifying DevOps workflows.

Example:
A simple Kubernetes deployment configuration for a web application.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: my-webapp-image:latest
ports:
- containerPort: 80

apiVersion: v1
kind: Service
metadata:
name: webapp-service
spec:
selector:
app: webapp
ports:
- protocol: TCP
port: 80
targetPort: 80
type: LoadBalancer
~~~

Docker:
- A platform for containerization, providing lightweight, isolated environments for running applications.
- Focuses on packaging applications and their dependencies into portable containers.
- Ensures consistency across different environments.

Kubernetes:
- A platform for orchestrating and managing containerized applications across a cluster of machines.
- Automates deployment, scaling, and operations of application containers.
- Ensures high availability, scalability, and fault tolerance.

Both Docker and Kubernetes are essential tools in modern DevOps practices, enabling the efficient development, deployment, and management of applications. Docker provides the foundation for containerization, while Kubernetes provides the tools to orchestrate and manage these containers at scale.

159
Q

Riak versus Cassandra

A

Cassandra use last write wins to resolve conflict while Riak use version vectors and store conflicts as siblings and let application servers merge them in a reasonable way.

Cassandra and Riak are both distributed NoSQL databases designed to handle large amounts of data across many commodity servers, providing high availability and scalability. However, they have different design philosophies and implementations. Here’s a detailed comparison:

Cassandra:
- Column-Family Store: Cassandra uses a column-family data model, which is similar to a tabular format but more flexible. Data is stored in rows that can have different columns.
- Schema: Supports a schema with predefined tables, but rows can have dynamic columns.
- Primary Use Cases: Time-series data, logging, and write-heavy applications.

Riak:
- Key-Value Store: Riak uses a key-value model where data is stored as opaque values accessed via unique keys.
- Schema: Schema-less; data is stored as blobs (Binary Large Objects), which can be serialized/deserialized as needed by the application.
- Primary Use Cases: Scenarios requiring high availability and fault tolerance, such as session storage and caching.

Cassandra:
- Masterless Architecture: Uses a peer-to-peer architecture with a ring topology, where all nodes are equal. Data is distributed using consistent hashing.
- Replication: Supports configurable replication across multiple data centers.
- Scalability: Scales linearly by adding more nodes to the cluster.
- Consistency: Tunable consistency (you can adjust the consistency level for reads and writes).

Riak:
- Masterless Architecture: Also uses a peer-to-peer architecture and consistent hashing for data distribution.
- Replication: Automatically replicates data across nodes. Riak supports eventual consistency.
- Scalability: Easily scalable by adding more nodes.
- Consistency: Primarily eventually consistent, but offers tunable consistency settings (via the N, R, and W parameters).

Cassandra:
- Consistency: Tunable, allowing you to choose between strong consistency and eventual consistency based on your needs. Options like QUORUM, ONE, and ALL can be set for reads and writes.
- Availability: Designed for high availability with no single point of failure. Data is replicated across multiple nodes.

Riak:
- Consistency: Primarily eventually consistent but offers tunable consistency. Uses the concept of vector clocks to manage conflicts.
- Availability: Highly available, designed to be resilient to node failures. Data is automatically replicated across the cluster.

Cassandra:
- CQL (Cassandra Query Language): Similar to SQL, making it easier for users familiar with relational databases to learn and use.
- Secondary Indexes: Supports secondary indexes for more complex queries.

Riak:
- Riak Query System: Uses a RESTful HTTP API or Protocol Buffers for interaction. Also supports MapReduce for more complex queries.
- Secondary Indexes: Supports secondary indexes but querying capabilities are generally less sophisticated compared to Cassandra.

Cassandra:
- Time-Series Data: Excellent for use cases like monitoring, logging, and IoT data.
- Write-Heavy Applications: Optimized for high write throughput.
- Geographically Distributed Data: Suitable for applications that require multi-data center deployments.

Riak:
- High Availability Needs: Ideal for use cases where high availability and fault tolerance are critical, such as session storage and caching.
- Eventual Consistency: Suitable for scenarios where eventual consistency is acceptable and high availability is more important.

Cassandra:
- Write Performance: Highly optimized for write-heavy workloads due to its append-only log-structured storage.
- Read Performance: Can be tuned based on consistency requirements, but read performance may be slower than some other databases for certain queries.

Riak:
- Write Performance: Generally good, but may not match Cassandra in write-heavy scenarios.
- Read Performance: Designed for high availability and can provide good read performance, but may suffer from increased latency due to conflict resolution mechanisms.

Cassandra:
- Community: Large and active community with strong support from the Apache Software Foundation.
- Ecosystem: Rich ecosystem with many tools and integrations available for data modeling, monitoring, and management.

Riak:
- Community: Smaller community compared to Cassandra, but still active.
- Ecosystem: Good support and tools available, but not as extensive as Cassandra’s ecosystem.

Cassandra:
- Strengths: High write performance, tunable consistency, large ecosystem, and strong community support. Excellent for time-series data and geographically distributed deployments.
- Weaknesses: Read performance can be less predictable, and complex querying capabilities are limited compared to traditional SQL databases.

Riak:
- Strengths: High availability, fault tolerance, and simplicity in key-value operations. Excellent for scenarios where high availability is critical.
- Weaknesses: Querying capabilities are less advanced, and the community and ecosystem are smaller compared to Cassandra.

The choice between Cassandra and Riak will depend on your specific use case, including your requirements for consistency, availability, performance, and the complexity of your data model and queries.

160
Q

The diff btw commit log and WAL

A

Commit logs and write-ahead logs (WAL) are both mechanisms used in database systems to ensure data durability and integrity. While they share similar purposes, they are used in slightly different contexts and have distinct characteristics. Here’s a detailed comparison of the two:

Purpose:
- The primary purpose of a commit log is to ensure data durability. It records every change to the database so that in the event of a failure, the database can be restored to its last consistent state.

Characteristics:
1. Data Durability:
- Ensures that all data changes are recorded before they are applied to the database.
2. Sequential Writes:
- Typically uses sequential writes to a log file, which can be efficiently written and read from disk.
3. Replication:
- In distributed databases, the commit log is often used to replicate changes across nodes to ensure consistency and fault tolerance.

Usage Example:
- Cassandra: Uses a commit log to record every write operation. Data is first written to the commit log and then to an in-memory structure called a memtable. The commit log ensures that no data is lost in case of a crash.

Workflow:
1. A write operation is initiated.
2. The write is recorded in the commit log.
3. The data is then written to the database (often in-memory first, then persisted).
4. Upon recovery, the database can replay the commit log to restore to the last consistent state.

Purpose:
- The primary purpose of a write-ahead log is to ensure atomicity and durability in database transactions. WAL ensures that all changes are logged before they are applied, which allows the database to roll back incomplete transactions and recover from crashes.

Characteristics:
1. Transaction Safety:
- Ensures that all changes within a transaction are recorded before they are applied, providing a way to roll back changes if necessary.
2. Sequential Writes:
- Similar to commit logs, WAL uses sequential writes to a log file, making it efficient for disk operations.
3. Recovery:
- Provides a mechanism to replay or roll back transactions during recovery to maintain database consistency.

Usage Example:
- PostgreSQL: Uses WAL to record changes before they are applied to the main database files. This ensures that transactions are atomic and durable, allowing the database to recover to a consistent state after a crash.

Workflow:
1. A transaction begins and modifies data.
2. The changes are written to the WAL before being applied to the database.
3. Once the transaction is committed, the changes are applied to the database.
4. If a crash occurs, the database can replay the WAL to apply committed transactions or roll back incomplete ones.

  • Commit Log:
    • Focuses on ensuring data durability by recording every write operation. Commonly used in distributed databases like Cassandra to ensure no data is lost during crashes and to facilitate data replication across nodes.
  • Write-Ahead Log (WAL):
    • Ensures transaction atomicity and durability by logging changes before they are applied to the database. Used in transactional databases like PostgreSQL to ensure that transactions can be rolled back if necessary and to maintain consistency after crashes.

Both commit logs and WALs are essential for maintaining data integrity and durability, but they serve slightly different roles based on the requirements of the database system they are used in.

Aspect | Commit Log | Write-Ahead Log (WAL) |
|—————————|———————————————|——————————————|
| Primary Purpose | Ensure data durability | Ensure transaction atomicity and durability |
| Usage Context | Distributed databases (e.g., Cassandra) | Transactional databases (e.g., PostgreSQL) |
| Data Recording | Records each write operation | Records changes before applying them |
| Transaction Handling | May not track transaction boundaries | Tracks transaction boundaries, ensuring atomicity |
| Recovery | Replays log to restore data | Replays or rolls back transactions to maintain consistency |
| Sequential Writes | Yes | Yes |
| Replication | Often used for data replication | Generally not used for replication |

161
Q

Kafka streams, Apache Spark Streaming, Apache Flink

A

Choosing between Kafka Streams, Apache Spark Streaming, and Apache Flink for stream processing depends on several factors, including the specific requirements of your application, performance needs, ease of use, scalability, and integration capabilities. Here are the trade-offs associated with each of these stream processing frameworks:

Kafka Streams

Pros:

1.	Tight Integration with Kafka:
•	Kafka Streams is part of the Kafka ecosystem and is designed to work seamlessly with Kafka. This tight integration simplifies the development of stream processing applications that use Kafka as the data source and sink.
2.	Simplicity and Ease of Use:
•	Kafka Streams provides a simple API for processing streams of data, which can be easier to learn and use compared to more complex frameworks like Spark and Flink.
•	It operates as a library within your application, requiring no separate cluster or infrastructure.
3.	Low Latency:
•	Designed for low-latency processing, making it suitable for real-time applications where quick response times are critical.
4.	Scalability:
•	Kafka Streams scales with the Kafka cluster. As Kafka partitions data, Kafka Streams can process these partitions in parallel, leveraging Kafka’s inherent scalability.

Cons:

1.	Feature Limitations:
•	Compared to Spark and Flink, Kafka Streams has fewer built-in features for complex stream processing tasks like advanced windowing, stateful processing, and complex event processing (CEP).
2.	Operational Overhead:
•	Since Kafka Streams runs within your application, it requires more attention to operational aspects such as scaling, fault tolerance, and monitoring.

Apache Spark Streaming

Pros:

1.	Unified Batch and Stream Processing:
•	Spark provides a unified framework for batch and stream processing, allowing you to use the same APIs and execution engine for both types of workloads.
2.	Rich Ecosystem:
•	Spark’s ecosystem includes powerful libraries for SQL (Spark SQL), machine learning (MLlib), and graph processing (GraphX), making it a versatile choice for various data processing tasks.
3.	Scalability and Fault Tolerance:
•	Spark Streaming is built on the Spark engine, which provides robust scalability and fault tolerance features.
4.	Advanced Analytics:
•	Supports advanced analytics and machine learning integration, enabling sophisticated data processing pipelines.

Cons:

1.	Higher Latency:
•	Spark Streaming originally used micro-batching, which can result in higher latencies compared to true stream processing frameworks. However, Spark Structured Streaming aims to address this with continuous processing, though it still might not match the low latency of Kafka Streams and Flink in some scenarios.
2.	Complexity:
•	Spark’s complexity can be higher compared to Kafka Streams, requiring more effort to set up, manage, and optimize.

Apache Flink

Pros:

1.	True Stream Processing:
•	Flink offers true stream processing with event-time processing and sophisticated windowing capabilities, providing low-latency processing.
2.	Advanced Features:
•	Flink provides advanced state management, exactly-once semantics, and complex event processing (CEP), making it suitable for complex real-time applications.
3.	Scalability and Performance:
•	Highly scalable with efficient resource management and support for large-scale data processing.
4.	Fault Tolerance:
•	Flink’s checkpointing mechanism ensures strong fault tolerance and state consistency.

Cons:

1.	Complexity:
•	Flink can be more complex to learn and use, with a steeper learning curve compared to Kafka Streams.
2.	Operational Overhead:
•	Requires managing a Flink cluster, which can introduce additional operational overhead compared to a library like Kafka Streams.

Choosing the Right Framework

•	Kafka Streams: Best for applications that require tight integration with Kafka, low latency, and simpler stream processing needs. Ideal for applications where Kafka is the primary data source and sink.
•	Apache Spark Streaming: Suitable for applications that need to process both batch and streaming data, leverage advanced analytics, or integrate with Spark’s rich ecosystem. Good for use cases where micro-batching latency is acceptable.
•	Apache Flink: Best for complex stream processing applications requiring true low-latency processing, advanced stateful processing, and exactly-once semantics. Ideal for applications with complex event processing requirements and real-time analytics.

The choice between these frameworks depends on the specific requirements of your application, including latency needs, processing complexity, scalability, and the existing technology stack.

162
Q

How does two phase locking prevents lost updates

A

You’re right to point out a critical detail: in practice, a transaction like T2 wouldn’t inherently know it needs to re-read X after being unblocked. This is where the concept of a “repeatable read” or ensuring proper transaction isolation levels becomes essential in database management.

Let’s revisit the steps ensuring T2 reads the correct, up-to-date value of X.

  1. Growing Phase:
    • T1 acquires a shared lock on X.
    • T2 acquires a shared lock on X.
    • Both T1 and T2 read X (X = 100).
  2. Transition to Exclusive Lock:
    • T1 needs to write to X, so it attempts to upgrade to an exclusive lock on X.
    • T2 also needs to write to X, so it attempts to upgrade to an exclusive lock on X.
  3. Conflict Resolution:
    • T1 acquires the exclusive lock first; T2 is blocked.
    • T1 writes X (X = 110).
    • T1 releases the exclusive lock on X.
  4. T2 Proceeds:
    • T2, which was blocked, now attempts to acquire the exclusive lock on X.
    • At this point, since T2 was blocked and it’s retrying the operation, the database system must ensure T2 re-reads the value of X to maintain consistency.
    • Database Management System’s Role:
      • The DBMS should enforce that T2 re-reads the value of X when it finally acquires the exclusive lock.
      • This can be achieved through a mechanism called “write skew” prevention, which ensures that T2’s read is validated again before the write.

This process typically involves mechanisms at the DBMS level that go beyond basic two-phase locking:

  1. Snapshot Isolation:
    • T2 operates under snapshot isolation where it reads a consistent snapshot of the data as of the start of its transaction.
    • Upon retrying after being unblocked, T2 detects that the data it read initially has been changed and must re-read.
  2. Validation Phase:
    • When T2 acquires the exclusive lock, it enters a validation phase where it re-checks the value of X to ensure it is consistent with the current state before proceeding with the write.
  1. Initial Reads:
    • T1 reads X = 100.
    • T2 reads X = 100.
  2. T1’s Update:
    • T1 upgrades to an exclusive lock, writes X = 110, and releases the lock.
  3. T2’s Update Attempt:
    • T2, now unblocked, re-acquires the lock and reads X again.
    • T2 detects that X has changed from 100 to 110 since its initial read.
    • Validation: T2 must re-validate its read against the current state of X.
    • T2 now reads X = 110 and proceeds with its operation, writing X = 130.
    • T2 releases the lock.

Two-phase locking prevents lost updates by ensuring exclusive access for writes. However, for transactions to correctly handle concurrent modifications, additional mechanisms such as snapshot isolation and validation phases are necessary. These mechanisms ensure that when a transaction like T2 is unblocked and acquires an exclusive lock, it re-reads and validates the data to reflect the latest state, maintaining consistency and preventing anomalies like lost updates.

163
Q

3 level of issues for isolation

A

Check the diagram

164
Q

What is MVCC

A

Multi-Version Concurrency Control (MVCC) is a method used by database management systems to handle concurrent transactions, ensuring consistency and isolation. MVCC allows multiple versions of data to coexist, which helps manage read and write operations without locking mechanisms that could otherwise significantly hinder performance.

  1. Versioning:
    • Snapshots: Each transaction works with a consistent snapshot of the database taken at the time the transaction starts. This snapshot includes the state of all data as it existed at that time.
    • Versions: When a transaction modifies data, it creates a new version of the data without overwriting the existing version. This allows other transactions to continue reading the old version while the new version is being prepared.
  2. Transaction Timestamps:
    • Start Timestamp: Each transaction is assigned a timestamp at the start, which helps determine the snapshot of data it will operate on.
    • Commit Timestamp: When a transaction commits, it gets a commit timestamp, which marks the version of data it has produced as the latest.
  3. Visibility Rules:
    • Read: Transactions can see data versions that were committed before the transaction started. They cannot see versions created by other transactions that started after them or versions that are uncommitted.
    • Write: Transactions create new versions of data without immediately affecting other transactions’ views. When a transaction commits, its changes become visible to subsequent transactions.
  1. Read Operation:
    • A transaction reads the version of data that was committed at the time the transaction started.
    • This ensures a consistent view of the data throughout the transaction, regardless of other concurrent writes.
  2. Write Operation:
    • When a transaction writes data, it creates a new version of the data.
    • The old version remains available for other transactions that started before the writing transaction committed.
    • This new version becomes visible to other transactions only after the writing transaction commits.
  • High Concurrency: MVCC allows multiple transactions to read and write data concurrently without blocking each other.
  • Consistent Reads: Transactions always read a consistent snapshot of the database, avoiding the need for locks on read operations.
  • Reduced Lock Contention: By allowing reads to proceed without locks, MVCC reduces contention between transactions, improving overall performance.
  • Storage Overhead: MVCC requires additional storage to maintain multiple versions of data.
  • Garbage Collection: Periodic cleanup (or vacuuming) is necessary to remove old, unneeded versions of data, which can add overhead.

Let’s consider an example using PostgreSQL, which implements MVCC.

Initial State:
- A table users with one row: id = 1, name = 'Alice', age = 30.

Transactions:
- T1 starts and reads the users table.
- T2 starts, updates age for id = 1 to 31, and commits.
- T1 continues and tries to read the users table again.

  1. T1 Begins:
    • T1 takes a snapshot of the database at the start.
    • T1 reads users where id = 1: sees name = 'Alice', age = 30.
  2. T2 Begins:
    • T2 takes a snapshot of the database at the start.
    • T2 updates users where id = 1 to age = 31.
    • T2 commits, making this change visible to subsequent transactions.
  3. T1 Continues:
    • T1 re-reads users where id = 1.
    • Since T1 operates on the snapshot taken at its start, it still sees name = 'Alice', age = 30.

Final State:
- For T1: name = 'Alice', age = 30 (snapshot view).
- For new transactions: name = 'Alice', age = 31.

PostgreSQL implements MVCC using system columns for each row:
- xmin: Transaction ID of the transaction that inserted the row.
- xmax: Transaction ID of the transaction that deleted the row (or null if not deleted).

When a transaction reads a row, it checks these columns to determine if the row is visible according to the transaction’s snapshot.

Multi-Version Concurrency Control (MVCC) is a concurrency control mechanism that allows multiple transactions to read and write data simultaneously by maintaining multiple versions of data. It provides high concurrency and consistent reads without the need for locks on read operations. MVCC is widely used in modern databases like PostgreSQL and MySQL’s InnoDB engine to balance performance and data consistency.

165
Q

How is serializability different from linerizability

A

Serializability and linearizability are both consistency models used in distributed systems and databases to ensure correctness of concurrent operations, but they have distinct definitions and use cases. Here’s a detailed explanation of the differences between them:

Definition:
- Serializability is a property of a schedule (sequence of operations) where the outcome is the same as if the transactions were executed in some serial order, one after another, without any overlap in time.
- It is the highest level of isolation in database systems and ensures that concurrent transactions do not interfere with each other in a way that would produce incorrect results.

Use Case:
- Database transactions, where ensuring the correctness of concurrent operations is critical, such as financial transactions or inventory management.

Key Points:
1. Transactional Context:
- Serializability applies to a sequence of transactions, ensuring that the interleaved execution of transactions is equivalent to some serial execution of those transactions.

  1. Anomalies Prevented:
    • Prevents anomalies such as dirty reads, non-repeatable reads, and phantom reads.
  2. Implementation:
    • Can be implemented using two-phase locking, timestamp ordering, or other concurrency control mechanisms.

Example:
- Transactions T1 and T2:
- T1: Read(X), Write(X)
- T2: Read(X), Write(X)
- A schedule is serializable if the final state of the database is the same as if T1 and T2 were executed sequentially, either T1 followed by T2 or T2 followed by T1.

Definition:
- Linearizability (or atomic consistency) is a property of individual operations on a single object, ensuring that each operation appears to occur instantaneously at some point between its invocation and its response.
- It provides real-time guarantees about the order of operations.

Use Case:
- Distributed systems and concurrent data structures, where ensuring immediate consistency of operations on shared objects is critical, such as key-value stores, distributed queues, or counters.

Key Points:
1. Real-Time Constraints:
- Linearizability considers the actual time at which operations occur and ensures that the results of operations are consistent with real-time ordering.

  1. Operation-Level Consistency:
    • Applies to individual operations on single objects rather than entire transactions.
  2. Anomalies Prevented:
    • Prevents anomalies such as stale reads and ensures immediate visibility of writes.

Example:
- Operations on a shared register R:
- Operation O1: Write(R, 1)
- Operation O2: Read(R) -> 1
- A system is linearizable if O2 returns 1 only if O1 has been completed, reflecting the real-time ordering of operations.

  • Serializability: Ensures a globally consistent order of transactions in a database. It is concerned with the correctness of interleaved transaction executions.
  • Linearizability: Ensures immediate consistency of operations on a single object. It is concerned with the correctness of individual operations in real-time.
  • Serializability: Ensures that transactions are executed in a way that is consistent with some serial order. It does not impose real-time constraints on the order of transactions.
  • Linearizability: Ensures that operations reflect the real-time order in which they were applied. It provides a stronger guarantee than serializability by considering the exact timing of operations.
  • Serializability: Typically implemented in databases to ensure the correctness of concurrent transactions. Common techniques include two-phase locking, timestamp ordering, and serialization graphs.
  • Linearizability: Often implemented in distributed systems to ensure immediate consistency of operations on shared objects. Common techniques include consensus algorithms like Paxos and Raft.

Consider a distributed key-value store:

  • Serializability:
    • Ensures that if two clients concurrently perform operations (e.g., one writes a value while another reads the value), the outcome is as if the operations were performed sequentially in some order.
    • Does not guarantee that the read operation will immediately see the result of the write operation in real-time.
  • Linearizability:
    • Ensures that if a client writes a value and another client reads the value, the read operation will see the most recent write immediately, reflecting the real-time order of operations.
    • Provides a stronger guarantee that the value read is the latest value written.
  • Serializability: Ensures that the outcome of executing concurrent transactions is equivalent to some serial execution of those transactions. It is used in databases to ensure transaction correctness without considering real-time constraints.
  • Linearizability: Ensures that operations on a single object appear instantaneously and reflect real-time ordering. It is used in distributed systems to provide immediate consistency and real-time guarantees for operations on shared objects.

Understanding the differences between these consistency models helps in choosing the right approach for ensuring correctness and consistency in various applications, whether they are transaction-oriented databases or operation-oriented distributed systems.

166
Q

How do we define event A happens before event B in distributed system?

A

Check picture

167
Q

What’s the difference between physical clock and logical clock in distributed system?

A

Logical clocks are designed to capture casual relationships

168
Q

What is gossip protocol

A

Check picture

169
Q

What’s the way to achieve total order broadcast

A

Check pic

170
Q

What’s idempotency

A

Check pic

171
Q

When should we use which broadcast algorithm

A
172
Q

Atomic commit versus distributed consensus

A

Check pic

173
Q

How is Two phase commit different from Raft for log replication

A

Raft and two-phase commit (2PC) share some similarities in that both are used to achieve consistency in distributed systems, but they are fundamentally different in their design, goals, and mechanisms. Here’s a detailed comparison to highlight their differences and similarities:

Overview:
- 2PC is a coordination protocol used to achieve atomic commitment in distributed systems, ensuring that all participants in a transaction either commit or abort the transaction.
- It is commonly used in distributed databases and transaction processing systems.

Phases:
1. Prepare Phase:
- The coordinator sends a prepare request to all participants, asking if they can commit the transaction.
- Each participant writes the transaction to its local log and responds with yes (ready to commit) or no (cannot commit).

  1. Commit Phase:
    • If all participants respond yes, the coordinator sends a commit request, and all participants commit the transaction.
    • If any participant responds no or if there is a failure, the coordinator sends an abort request, and all participants abort the transaction.

Characteristics:
- Atomicity: Ensures that a transaction is either fully committed or fully aborted across all participants.
- Blocking: If the coordinator crashes during the protocol, participants may be left in an uncertain state, waiting indefinitely for the coordinator to recover.
- Simplicity: Suitable for ensuring atomicity in distributed transactions but does not handle leader election, fault tolerance, or log replication on its own.

Overview:
- Raft is a consensus algorithm designed to manage a replicated log in a distributed system, ensuring strong consistency and fault tolerance.
- It is used for leader election, log replication, and maintaining a consistent state across distributed nodes.

Phases and Components:
1. Leader Election:
- Nodes elect a leader using a majority vote.
- The leader is responsible for managing the log and coordinating updates.

  1. Log Replication:
    • The leader accepts client requests and appends them to its log.
    • The leader sends log entries to followers and waits for a majority (quorum) to acknowledge them.
    • Once a majority acknowledges, the leader commits the entries and informs the followers.
  2. Safety and Fault Tolerance:
    • Raft ensures that committed log entries are durable and won’t be lost.
    • If the leader crashes, a new leader is elected, and the system continues to operate.

Characteristics:
- Strong Consistency: Ensures that all nodes have a consistent view of the log.
- Fault Tolerance: Designed to handle node failures and network partitions gracefully.
- Leadership: Uses a single leader to coordinate updates, which simplifies the protocol and ensures consistency.

Purpose:
- 2PC: Focuses on achieving atomic commitment for distributed transactions.
- Raft: Focuses on managing a replicated log, ensuring strong consistency, and handling leader election and fault tolerance.

Mechanisms:
- 2PC: Uses a two-phase approach (prepare and commit) to coordinate the commitment of transactions across participants.
- Raft: Uses leader election, log replication, and majority voting to maintain a consistent state across nodes.

Blocking vs. Non-Blocking:
- 2PC: Can be blocking if the coordinator fails, leaving participants in an uncertain state.
- Raft: Designed to be non-blocking, with mechanisms for leader election and recovery from failures.

Complexity and Use Cases:
- 2PC: Simpler but can be less robust in handling failures and requires additional mechanisms for leader election and fault tolerance.
- Raft: More complex but provides a comprehensive solution for consensus, leader election, log replication, and fault tolerance.

2PC Example:
- Distributed databases where a transaction spans multiple nodes, requiring atomic commitment.
- Ensuring that all participants in a distributed transaction either commit or abort the changes.

Raft Example:
- Distributed key-value stores (e.g., etcd, Consul) where strong consistency and fault tolerance are required.
- Managing a replicated log in a distributed system, ensuring that all nodes have the same state.

While Raft and two-phase commit share the goal of ensuring consistency in distributed systems, they are designed for different purposes and operate differently:

  • 2PC: A coordination protocol for achieving atomic commitment in distributed transactions. It ensures that all participants in a transaction either commit or abort but can be blocking and does not handle leader election or fault tolerance.
  • Raft: A consensus algorithm designed for managing a replicated log with strong consistency and fault tolerance. It handles leader election, log replication, and ensures that all nodes have a consistent view of the state.

Raft is not just about coordinating transactions but about maintaining a consistent and fault-tolerant distributed system, making it suitable for more complex and resilient distributed systems compared to the simpler 2PC protocol.

174
Q

Does Raft enforce that all reads go to leader?

A

Raft does not strictly enforce that all reads must go to the leader; however, it provides mechanisms to ensure consistency when reads are performed by followers. The primary reason reads often go to the leader is to maintain strong consistency, but Raft allows for optimized read operations under certain conditions.

  1. Leader-Based Reads:
    • By default, clients send read requests to the leader to ensure they get the most up-to-date data.
    • The leader has the latest committed state, so reads from the leader are always consistent.
  2. Read Index:
    • Raft provides a mechanism called the “read index” which allows followers to serve read requests while still ensuring consistency.
    • The leader periodically sends heartbeats (AppendEntries RPCs) to followers. These heartbeats contain the leader’s commit index.
    • A follower can serve a read request by ensuring that it has applied all entries up to a given read index, which corresponds to the latest commit index known by the leader.
  1. Strong Consistency:
    • To guarantee strong consistency, reads are typically routed through the leader.
    • This ensures that clients always see the latest committed data.
  2. Read Index Optimization:
    • To reduce the load on the leader and improve read performance, followers can serve read requests using the read index mechanism.
    • The read index ensures that the follower’s state is at least as up-to-date as the commit index included in the leader’s heartbeat.
  1. Leader Heartbeats:
    • The leader sends periodic heartbeats to all followers, including its current commit index.
    • Followers update their local read index to match the leader’s commit index.
  2. Serving Reads from Followers:
    • When a follower receives a read request, it checks its local read index.
    • If the follower’s applied log index is greater than or equal to the read index, it can safely serve the read request.
  1. Stale Reads Prevention:
    • Followers must ensure they are not serving stale data by verifying their read index.
    • If a follower’s state is not up-to-date, it can forward the read request to the leader or wait until it catches up.
  2. Read Request Redirection:
    • If a follower cannot guarantee that its state is up-to-date, it redirects the read request to the leader.
    • The leader can then serve the read request directly, ensuring consistency.
  1. Trade-offs:
    • Serving reads from followers can improve performance and reduce the leader’s load, but it requires careful handling to ensure consistency.
    • Implementing the read index mechanism adds complexity but provides a balance between performance and consistency.
  2. Cluster Configuration:
    • The use of leader-based reads versus follower-based reads with read index can be configured based on the application’s consistency and performance requirements.
  • Raft does not strictly enforce that all reads go to the leader, but it provides mechanisms to ensure consistent reads.
  • Leader-based reads guarantee strong consistency by always providing the latest committed data.
  • Read index mechanism allows followers to serve reads by ensuring they have applied the latest committed entries, reducing the load on the leader.
  • Stale reads are prevented by ensuring followers only serve reads when their state is up-to-date with the leader’s commit index.

By leveraging these mechanisms, Raft ensures that read operations can be performed efficiently while maintaining the necessary consistency guarantees, offering flexibility in how read requests are handled in a distributed system.

175
Q

How is linearizability implemented in distributed system?

A

Linearizability, also known as strong consistency, ensures that all operations appear to execute atomically in some order that is consistent with the real-time ordering of those operations. Implementing linearizability differs between single leader replication and leaderless replication due to their distinct architectures and operational mechanisms.

In single leader replication, one node (the leader) is responsible for handling all write operations and coordinating the replication of these writes to follower nodes. Here’s how linearizability is implemented in this context:

  1. Write Operations:
    • All writes are directed to the leader: Clients send their write requests to the leader node. The leader processes these writes in the order they are received.
    • Log Replication: The leader appends each write to its log and then replicates the log entry to follower nodes. The write is not considered committed until it has been acknowledged by a majority (quorum) of the followers.
    • Commit: Once a quorum of followers acknowledges the log entry, the leader commits the entry and applies the change to its state machine. The leader then informs the client that the write has been successfully completed.
  2. Read Operations:
    • Strong Consistency Reads: To ensure linearizability, read operations are directed to the leader, which has the most up-to-date state.
    • Follower Reads (Optional): In some systems, followers can handle read requests if they have mechanisms to ensure they are reading up-to-date data, such as reading only after a leader’s confirmation or using version numbers.
  3. Handling Failures:
    • Leader Failure: If the leader fails, a new leader is elected through a consensus protocol (e.g., Raft). The new leader must have the most up-to-date log entries to maintain linearizability.
    • Follower Failure: If a follower fails, it can catch up with the leader’s state when it recovers.

Example: Raft, an example of a consensus algorithm that provides linearizability in single leader replication:
- The leader handles all client interactions for reads and writes.
- Followers apply the log entries in the same order as the leader to ensure consistency.
- A new leader is elected only if it has the latest log entries, ensuring no committed writes are lost.

In leaderless replication, there is no single leader node. Instead, clients can send write and read requests to any node, and the system uses consensus mechanisms to achieve consistency. Implementing linearizability in leaderless replication is more complex and involves ensuring that all nodes reach a consistent state despite concurrent operations.

  1. Write Operations:
    • Quorum-Based Writes: Clients send write requests to a set of nodes (typically a majority). Each node that receives the request records the write and responds to the client.
    • Coordination: The client waits for acknowledgments from a quorum of nodes before considering the write operation successful. This quorum ensures that the write is replicated to a sufficient number of nodes to maintain consistency.
    • Versioning: Writes are often tagged with version numbers or timestamps to order them and detect conflicts.
  2. Read Operations:
    • Quorum-Based Reads: To ensure linearizability, read requests are sent to a set of nodes (again, typically a majority). The client waits for responses from a quorum of nodes and uses version numbers to determine the most recent value.
    • Read Repair: If inconsistencies are detected during a read, the client can initiate a read repair process to synchronize the nodes.
  3. Handling Conflicts:
    • Conflict Resolution: When concurrent writes lead to conflicts (e.g., different nodes receive different writes at the same logical time), the system must resolve these conflicts. Common strategies include last-write-wins (LWW), merging values, or application-specific resolution logic.

Example: Amazon DynamoDB, an example of a leaderless replication system:
- Clients use quorum-based protocols for both reads and writes.
- Versioning (vector clocks) is used to track and resolve conflicts.
- Linearizability is ensured by requiring reads and writes to involve a quorum of nodes, ensuring that any read reflects the most recent write.

  • Single Leader Replication: Linearizability is achieved by directing all writes to a single leader, which orders the writes and replicates them to followers. Reads are typically served by the leader to ensure they reflect the most recent writes.
  • Leaderless Replication: Linearizability is achieved using quorum-based protocols for reads and writes. Writes and reads are sent to a majority of nodes, and versioning is used to ensure that the most recent values are read and written, maintaining consistency across the system.

Each approach has its trade-offs in terms of complexity, performance, and fault tolerance, but both can be designed to ensure that operations appear to execute atomically and in a manner consistent with real-time ordering, providing strong consistency guarantees in distributed systems.

176
Q

Power of 2 to bytes

A

Check pic

177
Q

Latency numbers that you should remember

A

Check out picture

178
Q

Availability numbers

A