Week 6: Data Storage Part 2 Cloud-based Databases and Data Warehouses Flashcards
What is a RDBMS?
Definition: A Relational Database Management System (RDBMS) stores data in structured tables with predefined schemas and relationships. It guarantees strong consistency and data integrity using ACID properties.
Core Components: Uses techniques such as B-trees for indexing, buffer caches, and supports transactional (OLTP) workloads.
How do you extend a single-node RDBMS to multiple nodes?
Replication: Creating copies of the database for high availability.
Sharding: Splitting a large database into smaller, independent “shards” (using strategies like hash-based or query-based partitioning) so that each shard contains a subset of the data. This “share-nothing” architecture reduces bottlenecks but adds complexity in coordinating queries and maintaining consistency.
What is ACID?
Atomicity: Ensures all parts of a transaction succeed or none do.
Consistency: Guarantees that transactions only bring the database from one valid state to another.
Isolation: Prevents concurrent transactions from interfering with each other.
Durability: Once a transaction is committed, it remains so even if there is a system failure.
What is the Purpose of the Managed Relational Database?
Operational Simplicity: Offloads tasks such as provisioning, patching, backups, scaling, and failover.
High Availability: Offers built-in features like multi-AZ deployments, automatic failover, and replication.
Focus on Applications: Developers can concentrate on application logic rather than database maintenance.
Examples: AWS RDS automates these tasks for various engines (MySQL, PostgreSQL, etc.), reducing administrative overhead.
How Does Replication of RDBMS in the Cloud Work?
2 Mechanisms:
Synchronous Replication: Used in multi-AZ deployments, where a standby replica is kept up-to-date with the primary.
Asynchronous Replication: Used in read replicas to distribute read load, though these do not provide automatic failover.
What is AWS RDS (Definition, Automation, Deployment, Scaling, Performance)?
- Definition: A fully managed relational database service supporting multiple engines (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server).
- Automation: Automates provisioning, patching, backups, and recovery.
- Deployment Options: Offers both single-AZ (cost-effective) and multi-AZ (high availability) configurations.
- Scaling:
- Vertical Scaling: Resizing instances for more CPU/memory.
- Horizontal Scaling: Using read replicas to distribute query load.
- Performance: Uses a variety of storage types (General Purpose SSDs, Provisioned IOPS SSDs) and integrates performance monitoring tools like Performance Insights.
How do RDS Multi-AZ Deployments and Read Replicas Improve Availability and Performance?
Multi-AZ Deployments:
- High Availability: Maintain a standby replica in a different availability zone. In case of failure, automatic promotion minimizes downtime.
- Focus: Primarily designed for fault tolerance rather than scaling reads.
Read Replicas:
- Performance Scaling: Distribute read queries across multiple replicas, reducing load on the primary instance.
- Limitations: Do not provide automatic failover.
Differences Between AWS RDS and Amazon Aurora, and Use Cases for Each
AWS RDS:
- Traditional Approach: Uses conventional database architectures with tightly coupled compute and storage.
- Best For: Legacy applications or cases where fine-tuned control is needed.
- Deployment: Provides options for multi-AZ and read replicas, but scaling may require manual intervention.
Amazon Aurora:
- Cloud-Native Design: Separates compute from storage, enabling automatic and near-instant scaling.
- Performance: Offers up to 5× the throughput of MySQL and 3× that of PostgreSQL.
- High Availability: Uses a distributed, self-healing storage system with quorum-based replication.
- Use Cases: Mission-critical applications that require high performance, rapid failover, and scalability.
What is Amazon Aurora and How Does it Maintain High Availability and Scalability?
- Architecture: Fully managed, MySQL and PostgreSQL compatible database designed from the ground up for the cloud.
- High Availability:
- Data is stored in a distributed storage system replicated across multiple availability zones.
- Uses quorum-based writes to maintain data integrity even if some nodes fail.
- Scalability:
- Separates compute from storage so that each can scale independently.
- Supports up to 128 TB of storage and multiple read replicas.
- Failover: Near-instantaneous failover thanks to continuous background page generation and replication.
What are the Benefits of Aurora Serverless, and How Does it Handle Automatic Scaling for Variable Workloads?
On-Demand Scaling: Automatically adjusts capacity based on workload demands without manual intervention.
Cost Efficiency: Scales down (even pausing when idle) to reduce costs during low usage periods.
Fine-Grained Increments: Version 2 supports scaling in increments as small as half an ACU, offering near-instant response to workload changes.
Ideal For: Unpredictable or intermittent workloads such as development, testing, or applications with seasonal traffic.
What is the Purpose of the Redo Log in Aurora, and How Does it Improve Write Performance and Crash Recovery?
Aurora’s Innovation: The database instance writes only redo log records rather than full data pages.
Benefits:
Reduced Write Amplification: By avoiding full-page writes, Aurora minimizes the number of physical disk writes.
Faster Crash Recovery: The storage layer can reconstruct data pages on-demand from the redo logs, resulting in near-instant recovery.
How Does Amazon Use Quorums and Replication to Ensure High Availability and Fault Tolerance?
Writes: Aurora writes data to six copies across three availability zones but requires acknowledgment from only four copies (a quorum) to commit a transaction. This design allows the system to tolerate failures.
Reads: Under normal conditions, Aurora reads from the fastest node without needing a full quorum, but during recovery or consistency checks, a quorum read (from at least three nodes) is used.
What is Aurora Fast Repair?
Mechanism: Aurora’s storage is divided into protection groups (logical 10-GB units). If a failure occurs, only the affected group is repaired rather than the entire volume.
Outcome: This targeted, massively parallel repair process (often termed “Aurora Fast Repair”) leads to extremely quick recovery with minimal performance impact.
When are Quorum Reads Necessary?
During Recovery: When verifying the most recent write state after a restart or during failure recovery, quorum reads ensure data consistency.
For Critical Consistency Checks: When an application cannot tolerate any stale data.
Why Not Use Quorum Reads Exclusively?
Performance Cost: Quorum reads incur additional latency and resource overhead, so for routine operations the system reads from the fastest available node.
Efficiency: Using quorum reads exclusively would slow down normal operations without providing significant benefits under stable conditions.
Key Differences Between Aurora and Traditional Relational Databases
- Write Efficiency:
- Traditional RDBMS: Use full-page writes plus multiple logging steps (redo logs, binary logs, double writes) that lead to write amplification.
- Aurora: Writes only redo log records; the distributed storage layer reconstructs pages on demand, dramatically reducing write amplification and increasing throughput.
- Failure Recovery:
- Traditional RDBMS: Often experience prolonged recovery times as logs and full pages are replayed.
- Aurora: Leverages its self-healing, distributed storage system to recover almost instantly after a failure.
- Overall Architecture:
- Traditional Systems: Tightly couple compute and storage.
- Aurora: Decouples compute from storage, enabling independent scaling and rapid failover through quorum-based replication.
How Does Aurora Global Database Enable Low-Latency Replication and Cross-Region Failover?
Replication Across Regions: Uses a dedicated replication infrastructure where redo logs are transferred directly from the primary region to secondary regions.
Low Latency: Replication lag is typically kept under one second during normal operations.
Cross-Region Failover: In the event of a regional outage, a secondary region can be quickly promoted to handle write operations, ensuring business continuity.
What is a NoSQL Database? What Does it stand for?
NoSQL stands for “Not Only SQL.” NoSQL Databases are systems still support some SQL-like features, but they also offer alternative data models.
How do NoSQL Databases use key-value pairs? What is the advantages? What are the use cases?
Structure: Data is stored as key-value pairs. The key is used to store and retrieve the data, and the system does not look into the contents of the value.
Performance: Because retrieval is essentially a quick hash lookup, key-value databases are highly scalable and deliver very fast response times even with massive datasets.
Use Cases: Ideal for applications requiring simple get/put operations, such as caching and session management.
What is a Wide Column Database? What are some examples of Wide Column Databases?
- A wide column database organizes data into rows and dynamic columns, allowing each row to have a variable number of columns.
-Examples
Google’s BigTable: A fully managed wide-column NoSQL database.
HBase: Modeled after BigTable.
Cassandra: Inspired by the Dynamo paper and similar in approach to wide column key-value systems.
What are In Memory (Cache) Databases? (Common uses, Examples, pros and cons)
- Store data in RAM, enabling microsecond-level latency.
- Common Uses: Real-time applications, gaming, session storage, and real-time analytics.
- Examples:
- AWS ElastiCache (supports Redis and MemCached)
- Azure Cache for Redis
- Google’s offerings with Redis and MemCached
- Pros: Very fast access times; excellent for reducing load on primary databases.
- Cons: Memory is a limited resource; not ideal for persistent storage of very large datasets.
What are Document Databases (Examples, Pros and Cons)? How does it make querying efficient?
Databases: Designed to store semi-structured data (e.g., JSON or XML documents).
Querying: Indexes paths in the document tree to allow for efficient queries on the document’s contents.
Examples:
- - AWS DocumentDB (offers MongoDB compatibility)
- - Azure CosmosDB (originally known as DocumentDB)
- - Google Firestore (targeted at mobile app development)
- - IBM Cloudant and native MongoDB services
Pros: Flexible schema, Efficient for querying nested data
Cons: May require more complex query designs compared to traditional relational databases
What is DynamoDB?
- A fully managed, serverless NoSQL database service from AWS.
- Implements a simple key-value store based on a distributed B‑tree data structure.
- Uses a consistent hashing algorithm to distribute data across partitions, ensuring scalability.
- Ideal for applications that fit a simple key-value model, such as storing application state, session tokens, etc.
- Integrated with AWS services and can be used in both serverless and traditional application architectures.
What is the difference between a Query and a Scan?
Query:
Purpose: Retrieves items based on the primary key (and optionally sort key conditions).
Efficiency: Fast and efficient because it leverages the underlying index (B‑tree structure) built on the primary key.
Usage: Use when you know the primary key value and want to retrieve related items.
Scan:
Purpose: Reads every item in the table (or secondary index) and filters the results afterward.
Efficiency: More resource-intensive and slower, especially for large datasets.
Usage: Use sparingly, generally for infrequent or one-off queries where you need to filter on non-key attributes.