Week 6: Data Storage Part 2 Cloud-based Databases and Data Warehouses Flashcards

Question

DynamoDB usage model (Put and Get)

Answer 1

Put Operation: - Function: Inserts an item into a table. - Process: You create a table (via AWS Console, CLI, or API) and then add items using a put command (for example, with the Python Boto3 API). Get Operation: - Function: Retrieves an item from the table using its primary key. - Process: The get command uses the key (or composite key) to locate and return the item.

Answer 2

Primary Key: Simple Primary Key: Consists of a single partition key. Composite Primary Key: Consists of a partition key and a sort key, allowing multiple items to share the same partition key but be uniquely identified by the sort key. Sort Key: Function: Orders items within the same partition, making it easier to query ranges of data. Secondary Indexes: Local Secondary Index (LSI): Uses the same partition key as the table but a different sort key; provides strong consistency but is limited in partition size. Global Secondary Index (GSI): an index with a partition keys and range key that can be different from those on the table. Offers more flexibility, not limited to the partition and is eventually consistent.

Answer 3

Tables: Collections of items (analogous to relational tables) but without a fixed schema. Items: Individual records within a table (similar to rows). Items can have varying attributes. Attributes: Data elements within an item (similar to columns). Primary Keys: Unique identifiers for items. They are critical for data partitioning and retrieval.

Answer 4

Partition Key: A single attribute that is hashed to determine the partition where the item is stored. Used in simple key-value operations. Composite Primary Key: Consists of a partition key plus a sort key. Allows multiple items with the same partition key but different sort keys. This provides additional flexibility in how data is organized and queried.

Answer 5

- Instead of using SQL joins, DynamoDB encourages denormalizing your data. Multiple types of entities (e.g., authors and books) can be stored in the same table by designing composite keys that “pre-join” related data at write time. - Use the Query API on the composite key to retrieve all related items (for example, all books by a specific author) without needing runtime join operations. Benefits: - Eliminates the expensive join operation, making data retrieval fast and scalable. Considerations: - Data redundancy means that maintaining consistency becomes the responsibility of the application.

Answer 6

Using Complex Attributes: Embed a list or map (e.g., a list of book titles in an author record). Works well only when the list is small and you don’t need to query individual elements. Composite Primary Key with Query API: Store the parent and child entities in the same table. Use the parent’s identifier as the partition key. Use the sort key to differentiate each child (e.g., each book title). Allows you to retrieve all child records (or a subset) efficiently via a single query. Secondary Indexes & Hierarchical Sort Keys: Create secondary indexes to support alternate query patterns or use hierarchical sort keys (e.g., for drilling down on dates or locations).

Answer 7

Performance Benefits: Data is combined (denormalized) at write time, so queries can retrieve all the necessary information in one go. Eliminates the need for complex, runtime join operations which are costly at scale. Trade-offs: Information is duplicated, so updates might need to be propagated to multiple records. Ensuring data consistency becomes the responsibility of your application logic.

Answer 8

Targeted Retrieval: Retrieves items based on specific primary key values. Sort Key Conditions: Supports conditions such as equality, less than, greater than, between, and “begins with” on the sort key. Advantages: Uses the underlying indexed structure (B‑tree) to quickly return relevant items. Can return an entire collection of items that share the same partition key, making it ideal for retrieving related entities. Usage Scenario: Best used when your data model is designed to accommodate your application’s access patterns, minimizing the need for full table scans.

Answer 9

Google Cloud Spanner is a globally distributed, strongly consistent relational database designed for high-performance, horizontal scalability across multiple data centers. It was originally an internal system at Google and later made publicly available.

Answer 10

TrueTime is Google’s globally synchronized clock system that provides time stamps as intervals with a bounded uncertainty (epsilon). It was developed to address the problem of ordering transactions in a distributed system where conventional clocks can never be perfectly synchronized. It enables Spanner to assign globally consistent time stamps to transactions, ensuring that the real-time order of transactions is maintained. This guarantees external consistency—a guarantee even stronger than traditional serializability. TrueTime works by combining signals from GPS satellites and atomic clocks in each Google data center. Instead of returning a single time value, it returns an interval (e.g., “current time is between x and y”) and uses the upper bound of this range when committing a transaction. This waiting period ensures that no transaction can be assigned an earlier time stamp later, preventing anomalies.

Answer 11

External Consistency (as used in Spanner): Guarantees that if one transaction commits before another begins, all nodes in the system see the transactions in that same order. It is even stronger than typical serializability. Strong Consistency: All reads reflect the most recent write. In systems that are strongly consistent, every node returns the same data after a successful write. Weak Consistency: Allows for temporary discrepancies between replicas. Reads might not always reflect the most recent write but will eventually converge.

Answer 12

Spanner opts for Consistency and Partition Tolerance. In scenarios like network failures, Spanner may sacrifice Availability by refusing to provide a response rather than risking inconsistency.

Answer 13

Paxos is a consensus algorithm that helps distributed systems agree on a single value (or transaction order) even in the presence of failures. Role in Spanner: - Each data shard is managed by a Paxos group to ensure fault tolerance. - Spanner integrates Paxos with a two-phase commit protocol. Instead of relying solely on a traditional two-phase commit (which would require all participants to be online), each participant is a Paxos group. This design helps the system reach consensus even if some nodes fail.

Answer 14

Microsoft Azure: Uses GPS clock synchronization via its VMIC time sync provider. Amazon AWS: Provides the Amazon Time Sync Service, offering highly accurate time synchronization across regions.

Answer 15

Azure Cosmos DB is a globally distributed, multimodal database service designed to address challenges of global distribution, low latency, and flexible data models. Cosmos DB is built as a write-optimized system. It is designed for low-latency transactions and high throughput, meaning that it efficiently handles a high volume of write operations. Every document (typically JSON) is automatically indexed. The system indexes every path within a document, which eliminates the need for manual index management. It uses an indexing system based on the BWTree, a latch-free, lock-structured record store optimized for modern multi-core processors. This design ensures high-speed updates and efficient query performance.

Answer 16

Eventual Consistency: Lowest latency, updates propagate asynchronously. Temporary inconsistencies may occur; all changes will eventually be seen. Consistent Prefix: Guarantees that writes are never returned out-of-order. Some updates might be missing if they haven’t been fully replicated yet. Session Consistency: Ensures that a client (or session) sees its own updates in order. Widely used for personalized user experiences like shopping carts. Bounded Staleness Consistency: Provides a guarantee that reads lag behind writes by a fixed time interval or number of operations. Ensures a global ordering of writes with controlled staleness. Strong Consistency: Guarantees that every read returns the most recent committed write. For performance reasons, this mode is typically limited to a single region.