Week 6: Data Storage Part 2 Cloud-based Databases and Data Warehouses Flashcards

1
Q

What is a RDBMS?

A

Definition: A Relational Database Management System (RDBMS) stores data in structured tables with predefined schemas and relationships. It guarantees strong consistency and data integrity using ACID properties.
Core Components: Uses techniques such as B-trees for indexing, buffer caches, and supports transactional (OLTP) workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you extend a single-node RDBMS to multiple nodes?

A

Replication: Creating copies of the database for high availability.
Sharding: Splitting a large database into smaller, independent “shards” (using strategies like hash-based or query-based partitioning) so that each shard contains a subset of the data. This “share-nothing” architecture reduces bottlenecks but adds complexity in coordinating queries and maintaining consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is ACID?

A

Atomicity: Ensures all parts of a transaction succeed or none do.
Consistency: Guarantees that transactions only bring the database from one valid state to another.
Isolation: Prevents concurrent transactions from interfering with each other.
Durability: Once a transaction is committed, it remains so even if there is a system failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Purpose of the Managed Relational Database?

A

Operational Simplicity: Offloads tasks such as provisioning, patching, backups, scaling, and failover.
High Availability: Offers built-in features like multi-AZ deployments, automatic failover, and replication.
Focus on Applications: Developers can concentrate on application logic rather than database maintenance.
Examples: AWS RDS automates these tasks for various engines (MySQL, PostgreSQL, etc.), reducing administrative overhead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How Does Replication of RDBMS in the Cloud Work?

A

2 Mechanisms:
Synchronous Replication: Used in multi-AZ deployments, where a standby replica is kept up-to-date with the primary.
Asynchronous Replication: Used in read replicas to distribute read load, though these do not provide automatic failover.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is AWS RDS (Definition, Automation, Deployment, Scaling, Performance)?

A
  • Definition: A fully managed relational database service supporting multiple engines (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server).
  • Automation: Automates provisioning, patching, backups, and recovery.
  • Deployment Options: Offers both single-AZ (cost-effective) and multi-AZ (high availability) configurations.
  • Scaling:
    • Vertical Scaling: Resizing instances for more CPU/memory.
    • Horizontal Scaling: Using read replicas to distribute query load.
  • Performance: Uses a variety of storage types (General Purpose SSDs, Provisioned IOPS SSDs) and integrates performance monitoring tools like Performance Insights.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do RDS Multi-AZ Deployments and Read Replicas Improve Availability and Performance?

A

Multi-AZ Deployments:
- High Availability: Maintain a standby replica in a different availability zone. In case of failure, automatic promotion minimizes downtime.
- Focus: Primarily designed for fault tolerance rather than scaling reads.

Read Replicas:
- Performance Scaling: Distribute read queries across multiple replicas, reducing load on the primary instance.
- Limitations: Do not provide automatic failover.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Differences Between AWS RDS and Amazon Aurora, and Use Cases for Each

A

AWS RDS:
- Traditional Approach: Uses conventional database architectures with tightly coupled compute and storage.
- Best For: Legacy applications or cases where fine-tuned control is needed.
- Deployment: Provides options for multi-AZ and read replicas, but scaling may require manual intervention.

Amazon Aurora:
- Cloud-Native Design: Separates compute from storage, enabling automatic and near-instant scaling.
- Performance: Offers up to 5× the throughput of MySQL and 3× that of PostgreSQL.
- High Availability: Uses a distributed, self-healing storage system with quorum-based replication.
- Use Cases: Mission-critical applications that require high performance, rapid failover, and scalability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Amazon Aurora and How Does it Maintain High Availability and Scalability?

A
  • Architecture: Fully managed, MySQL and PostgreSQL compatible database designed from the ground up for the cloud.
  • High Availability:
    • Data is stored in a distributed storage system replicated across multiple availability zones.
    • Uses quorum-based writes to maintain data integrity even if some nodes fail.
  • Scalability:
    • Separates compute from storage so that each can scale independently.
    • Supports up to 128 TB of storage and multiple read replicas.
  • Failover: Near-instantaneous failover thanks to continuous background page generation and replication.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the Benefits of Aurora Serverless, and How Does it Handle Automatic Scaling for Variable Workloads?

A

On-Demand Scaling: Automatically adjusts capacity based on workload demands without manual intervention.
Cost Efficiency: Scales down (even pausing when idle) to reduce costs during low usage periods.
Fine-Grained Increments: Version 2 supports scaling in increments as small as half an ACU, offering near-instant response to workload changes.
Ideal For: Unpredictable or intermittent workloads such as development, testing, or applications with seasonal traffic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Purpose of the Redo Log in Aurora, and How Does it Improve Write Performance and Crash Recovery?

A

Aurora’s Innovation: The database instance writes only redo log records rather than full data pages.
Benefits:
Reduced Write Amplification: By avoiding full-page writes, Aurora minimizes the number of physical disk writes.
Faster Crash Recovery: The storage layer can reconstruct data pages on-demand from the redo logs, resulting in near-instant recovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How Does Amazon Use Quorums and Replication to Ensure High Availability and Fault Tolerance?

A

Writes: Aurora writes data to six copies across three availability zones but requires acknowledgment from only four copies (a quorum) to commit a transaction. This design allows the system to tolerate failures.
Reads: Under normal conditions, Aurora reads from the fastest node without needing a full quorum, but during recovery or consistency checks, a quorum read (from at least three nodes) is used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Aurora Fast Repair?

A

Mechanism: Aurora’s storage is divided into protection groups (logical 10-GB units). If a failure occurs, only the affected group is repaired rather than the entire volume.
Outcome: This targeted, massively parallel repair process (often termed “Aurora Fast Repair”) leads to extremely quick recovery with minimal performance impact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When are Quorum Reads Necessary?

A

During Recovery: When verifying the most recent write state after a restart or during failure recovery, quorum reads ensure data consistency.
For Critical Consistency Checks: When an application cannot tolerate any stale data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why Not Use Quorum Reads Exclusively?

A

Performance Cost: Quorum reads incur additional latency and resource overhead, so for routine operations the system reads from the fastest available node.
Efficiency: Using quorum reads exclusively would slow down normal operations without providing significant benefits under stable conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Key Differences Between Aurora and Traditional Relational Databases

A
  • Write Efficiency:
    • Traditional RDBMS: Use full-page writes plus multiple logging steps (redo logs, binary logs, double writes) that lead to write amplification.
    • Aurora: Writes only redo log records; the distributed storage layer reconstructs pages on demand, dramatically reducing write amplification and increasing throughput.
  • Failure Recovery:
    • Traditional RDBMS: Often experience prolonged recovery times as logs and full pages are replayed.
    • Aurora: Leverages its self-healing, distributed storage system to recover almost instantly after a failure.
  • Overall Architecture:
    • Traditional Systems: Tightly couple compute and storage.
    • Aurora: Decouples compute from storage, enabling independent scaling and rapid failover through quorum-based replication.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How Does Aurora Global Database Enable Low-Latency Replication and Cross-Region Failover?

A

Replication Across Regions: Uses a dedicated replication infrastructure where redo logs are transferred directly from the primary region to secondary regions.
Low Latency: Replication lag is typically kept under one second during normal operations.
Cross-Region Failover: In the event of a regional outage, a secondary region can be quickly promoted to handle write operations, ensuring business continuity.

18
Q

What is a NoSQL Database? What Does it stand for?

A

NoSQL stands for “Not Only SQL.” NoSQL Databases are systems still support some SQL-like features, but they also offer alternative data models.

19
Q

How do NoSQL Databases use key-value pairs? What is the advantages? What are the use cases?

A

Structure: Data is stored as key-value pairs. The key is used to store and retrieve the data, and the system does not look into the contents of the value.
Performance: Because retrieval is essentially a quick hash lookup, key-value databases are highly scalable and deliver very fast response times even with massive datasets.
Use Cases: Ideal for applications requiring simple get/put operations, such as caching and session management.

20
Q

What is a Wide Column Database? What are some examples of Wide Column Databases?

A
  • A wide column database organizes data into rows and dynamic columns, allowing each row to have a variable number of columns.
    -Examples
    Google’s BigTable: A fully managed wide-column NoSQL database.
    HBase: Modeled after BigTable.
    Cassandra: Inspired by the Dynamo paper and similar in approach to wide column key-value systems.
21
Q

What are In Memory (Cache) Databases? (Common uses, Examples, pros and cons)

A
  • Store data in RAM, enabling microsecond-level latency.
  • Common Uses: Real-time applications, gaming, session storage, and real-time analytics.
  • Examples:
    • AWS ElastiCache (supports Redis and MemCached)
    • Azure Cache for Redis
    • Google’s offerings with Redis and MemCached
  • Pros: Very fast access times; excellent for reducing load on primary databases.
  • Cons: Memory is a limited resource; not ideal for persistent storage of very large datasets.
22
Q

What are Document Databases (Examples, Pros and Cons)? How does it make querying efficient?

A

Databases: Designed to store semi-structured data (e.g., JSON or XML documents).

Querying: Indexes paths in the document tree to allow for efficient queries on the document’s contents.

Examples:
- - AWS DocumentDB (offers MongoDB compatibility)
- - Azure CosmosDB (originally known as DocumentDB)
- - Google Firestore (targeted at mobile app development)
- - IBM Cloudant and native MongoDB services

Pros: Flexible schema, Efficient for querying nested data
Cons: May require more complex query designs compared to traditional relational databases

23
Q

What is DynamoDB?

A
  • A fully managed, serverless NoSQL database service from AWS.
  • Implements a simple key-value store based on a distributed B‑tree data structure.
  • Uses a consistent hashing algorithm to distribute data across partitions, ensuring scalability.
  • Ideal for applications that fit a simple key-value model, such as storing application state, session tokens, etc.
  • Integrated with AWS services and can be used in both serverless and traditional application architectures.
24
Q

What is the difference between a Query and a Scan?

A

Query:
Purpose: Retrieves items based on the primary key (and optionally sort key conditions).
Efficiency: Fast and efficient because it leverages the underlying index (B‑tree structure) built on the primary key.
Usage: Use when you know the primary key value and want to retrieve related items.

Scan:
Purpose: Reads every item in the table (or secondary index) and filters the results afterward.
Efficiency: More resource-intensive and slower, especially for large datasets.
Usage: Use sparingly, generally for infrequent or one-off queries where you need to filter on non-key attributes.

25
DynamoDB usage model (Put and Get)
Put Operation: - Function: Inserts an item into a table. - Process: You create a table (via AWS Console, CLI, or API) and then add items using a put command (for example, with the Python Boto3 API). Get Operation: - Function: Retrieves an item from the table using its primary key. - Process: The get command uses the key (or composite key) to locate and return the item.
26
DynamoDB Primary key (simple vs Composite), Sort key, Secondary Index (Local vs Global)
Primary Key: Simple Primary Key: Consists of a single partition key. Composite Primary Key: Consists of a partition key and a sort key, allowing multiple items to share the same partition key but be uniquely identified by the sort key. Sort Key: Function: Orders items within the same partition, making it easier to query ranges of data. Secondary Indexes: Local Secondary Index (LSI): Uses the same partition key as the table but a different sort key; provides strong consistency but is limited in partition size. Global Secondary Index (GSI): an index with a partition keys and range key that can be different from those on the table. Offers more flexibility, not limited to the partition and is eventually consistent.
27
How does DynamoDB structure data using tables, items, attributes, and primary keys?
Tables: Collections of items (analogous to relational tables) but without a fixed schema. Items: Individual records within a table (similar to rows). Items can have varying attributes. Attributes: Data elements within an item (similar to columns). Primary Keys: Unique identifiers for items. They are critical for data partitioning and retrieval.
28
What is the difference between a partition key and a composite primary key in DynamoDB?
Partition Key: A single attribute that is hashed to determine the partition where the item is stored. Used in simple key-value operations. Composite Primary Key: Consists of a partition key plus a sort key. Allows multiple items with the same partition key but different sort keys. This provides additional flexibility in how data is organized and queried.
29
How does DynamoDB handle relational data without using traditional SQL joins?
- Instead of using SQL joins, DynamoDB encourages denormalizing your data. Multiple types of entities (e.g., authors and books) can be stored in the same table by designing composite keys that “pre-join” related data at write time. - Use the Query API on the composite key to retrieve all related items (for example, all books by a specific author) without needing runtime join operations. Benefits: - Eliminates the expensive join operation, making data retrieval fast and scalable. Considerations: - Data redundancy means that maintaining consistency becomes the responsibility of the application.
30
What are the key strategies for modeling one-to-many relationships in DynamoDB?
Using Complex Attributes: Embed a list or map (e.g., a list of book titles in an author record). Works well only when the list is small and you don’t need to query individual elements. Composite Primary Key with Query API: Store the parent and child entities in the same table. Use the parent’s identifier as the partition key. Use the sort key to differentiate each child (e.g., each book title). Allows you to retrieve all child records (or a subset) efficiently via a single query. Secondary Indexes & Hierarchical Sort Keys: Create secondary indexes to support alternate query patterns or use hierarchical sort keys (e.g., for drilling down on dates or locations).
31
How does denormalization improve performance in DynamoDB, and what are the trade-offs?
Performance Benefits: Data is combined (denormalized) at write time, so queries can retrieve all the necessary information in one go. Eliminates the need for complex, runtime join operations which are costly at scale. Trade-offs: Information is duplicated, so updates might need to be propagated to multiple records. Ensuring data consistency becomes the responsibility of your application logic.
32
What is the Query API, and what are the key features of it?
Targeted Retrieval: Retrieves items based on specific primary key values. Sort Key Conditions: Supports conditions such as equality, less than, greater than, between, and “begins with” on the sort key. Advantages: Uses the underlying indexed structure (B‑tree) to quickly return relevant items. Can return an entire collection of items that share the same partition key, making it ideal for retrieving related entities. Usage Scenario: Best used when your data model is designed to accommodate your application’s access patterns, minimizing the need for full table scans.
33
What is Spanner?
Google Cloud Spanner is a globally distributed, strongly consistent relational database designed for high-performance, horizontal scalability across multiple data centers. It was originally an internal system at Google and later made publicly available.
34
What is TrueTime, why was it created, what does it enable, and how does it work?
TrueTime is Google’s globally synchronized clock system that provides time stamps as intervals with a bounded uncertainty (epsilon). It was developed to address the problem of ordering transactions in a distributed system where conventional clocks can never be perfectly synchronized. It enables Spanner to assign globally consistent time stamps to transactions, ensuring that the real-time order of transactions is maintained. This guarantees external consistency—a guarantee even stronger than traditional serializability. TrueTime works by combining signals from GPS satellites and atomic clocks in each Google data center. Instead of returning a single time value, it returns an interval (e.g., “current time is between x and y”) and uses the upper bound of this range when committing a transaction. This waiting period ensures that no transaction can be assigned an earlier time stamp later, preventing anomalies.
35
External vs. Strong vs. Weak Consistency
External Consistency (as used in Spanner): Guarantees that if one transaction commits before another begins, all nodes in the system see the transactions in that same order. It is even stronger than typical serializability. Strong Consistency: All reads reflect the most recent write. In systems that are strongly consistent, every node returns the same data after a successful write. Weak Consistency: Allows for temporary discrepancies between replicas. Reads might not always reflect the most recent write but will eventually converge.
36
which of the C+A+P does Spanner sacrifice?
Spanner opts for Consistency and Partition Tolerance. In scenarios like network failures, Spanner may sacrifice Availability by refusing to provide a response rather than risking inconsistency.
37
What is Paxos, and what role does it play in Spanner?
Paxos is a consensus algorithm that helps distributed systems agree on a single value (or transaction order) even in the presence of failures. Role in Spanner: - Each data shard is managed by a Paxos group to ensure fault tolerance. - Spanner integrates Paxos with a two-phase commit protocol. Instead of relying solely on a traditional two-phase commit (which would require all participants to be online), each participant is a Paxos group. This design helps the system reach consensus even if some nodes fail.
38
Does any other service besides TrueTime use HW-Based Time Synchronization?
Microsoft Azure: Uses GPS clock synchronization via its VMIC time sync provider. Amazon AWS: Provides the Amazon Time Sync Service, offering highly accurate time synchronization across regions.
39
What is CosmosDB? Is it write-optimized or read-optimized? How is indexing achieved?
Azure Cosmos DB is a globally distributed, multimodal database service designed to address challenges of global distribution, low latency, and flexible data models. Cosmos DB is built as a write-optimized system. It is designed for low-latency transactions and high throughput, meaning that it efficiently handles a high volume of write operations. Every document (typically JSON) is automatically indexed. The system indexes every path within a document, which eliminates the need for manual index management. It uses an indexing system based on the BWTree, a latch-free, lock-structured record store optimized for modern multi-core processors. This design ensures high-speed updates and efficient query performance.
40
What Consistency Models does CosmosDB support? How do they differ from one another?
Eventual Consistency: Lowest latency, updates propagate asynchronously. Temporary inconsistencies may occur; all changes will eventually be seen. Consistent Prefix: Guarantees that writes are never returned out-of-order. Some updates might be missing if they haven’t been fully replicated yet. Session Consistency: Ensures that a client (or session) sees its own updates in order. Widely used for personalized user experiences like shopping carts. Bounded Staleness Consistency: Provides a guarantee that reads lag behind writes by a fixed time interval or number of operations. Ensures a global ordering of writes with controlled staleness. Strong Consistency: Guarantees that every read returns the most recent committed write. For performance reasons, this mode is typically limited to a single region.