Deep Dive - DynamoDB Flashcards
What is used as the primary key of a table in DynamoDB?
The partition key is part of the primary key of a table. It is required.
Optionally, the sort key can also be part of the primary key. This is equivalent to a multiple column primary key in MySQL (composite primary key)
What is the max size of a row (item) in a DynamoDB table?
400 kilobytes, which is less than 1 megabyte.
When should you have a sort key?
- If you want to do range queries. For example, find users with birthdate between 1980-1990
- When your queries want to sort data. E.g. return all users sorted by lastname
- A Join table. Maybe you want to connect Employees and Departments. You can have a table with composite key (employee_id, department_id)
How do you find the node (or cluster) that a primary key belongs to?
Apply a consistent hashing algorithm to find the appropriate node/cluster.
Describe the indexes that can be used with DynamoDB
Global Secondary Index. It has a partition key (and optional sort key). The partition key should be different than the partition key of the main table. This is analogous to a secondary index in MySQL.
Local Secondary Index - has same partition key as main table, but a different sort key.
LSIs can only query a single partition. Also, LSIs have strongly consistent reads. LSIs share capacity with the main table, so they might experience throttling.
GSIs are more flexible. Any LSI can be modeled as a GSI.
GSIs can have 20 indexes, LSIs only 5.
Give example usage of GSI for a chat app. Main table has (partition key, sort key) =
(chat_id, timestamp)
You could have the the following GSI if you want to view all chat messages for a specific user.
(user_id, timestamp)
This way you can easily query for all chat messages for user_id = 123. and the results can be sorted by timestamp.
What are the primary ways of accessing data in DynamoDB?
- Scan operation
Reads every item in a table or index and returns the results in a paginated response - Query operation
Retrieves items based on the primary key or secondary index key attributes. Queries are more efficient than scans, as they only read items that match the specified key conditions. Queries can also be used to perform range queries on the sort key.
What does this Mysql query look like in DynamoDB?
SELECT * FROM users WHERE user_id = 101
Node.js (yuck)
const params = {
TableName: ‘users’,
KeyConditionExpression: ‘user_id = :id’,
ExpressionAttributeValues: {
‘:id’: 101
}
};
dynamodb.query(params, (err, data) => {
if (err) console.error(err);
else console.log(data);
});
What is the CAP theorem?
In a distributed datastore, you can only choose 2 of the following:
1. Consistency - every read receives data from the most recent write
- Availability - every request receives a response, even if some servers are down. The system continues to operate.
- Partition Tolerance - system still works even if network communication fails. This is requirement for all good distributed systems. So really you have to choose between Consistency and Availability.
What is strongly consistent versus eventually consistent for a distributed DB? By default, is DynamoDB strongly or eventually consistent?
Strongly Consistent - every read operation always returns the most recent write.
In a distributed DB, this means all nodes must sync immediately for each write. This takes time. It also involves a consensus algorithm, locking. If there is a network partition (huh?), the system might sacrifice availability
Eventual Consistency - all replicas of that data will eventually converge to the same value. Reads might be stale or return inconsistent data.
When a write occurs, data is propagated to other nodes asynchronously. This means lower latency for reads/writes. It also means higher availability.
DynamoDB is eventually consistent by default.
What is a network partition (in the context of CAP theorem)?
A network partition is a temporary interruption in a distributed system’s network that prevents nodes from communicating with each other. This can happen due to network failures or disruptions, and it can divide the network into separate subnetworks, or partitions, that are unaware of each other’s existence
What kinds of applications need strong consistency?
What kinds can tolerate eventual consistency?
Banking, finance, inventory should have strong consistency
Social media, content delivery, gaming,
How does DynamoDB maintain consistency under the hood?
- Dynamo replicates data to multiple nodes that are in different availability zones. Typically, it replicates to at least 3 other nodes
- Eventual consistency - writes are considered successful when they have been ACK’d by a majority of replicates (e.g. 2 out of 3). Reads can be served by any replica, which might return old data. Background processes sync the data.
- Strong Consistency - when strong consistency is used, reads are routed to the “leader” node. Leader ensures it has the most up-to-date data before replying with a result. Leader might wait for replication to complete before responding.
- Quorum based approach. Dynamo uses a quorum approach. For writes in eventually consistent mode, quorum. A quorum is basically a majority.
- Conflict resolution - if 2 people writing to different replicas. The later (timestamp) write is the winner.
- Transaction support - Use two-phase commit locking across multiple partitions.