L9/12 - OLAP & OLTP in the cloud (DBaaS) Flashcards
What is the dominant architecture for high-performance data warehousing?
Shared-nothing architectures, where each node has its own storage, memory, and processing power.
What are the three main architectural dimensions of a data warehouse?
Storage – Data layout, format, partitioning, and distribution.
Query Engine – Optimization, execution models, parallelism, and multi-tenancy.
Cluster – Metadata sharing, resource management, and allocation.
What are the advantages of a column-store over a row-store?
Only reads relevant columns → better CPU cache usage.
More efficient compression (e.g., RLE, gzip).
Faster analytical queries, as fewer disk reads are required.
Examples: Apache Parquet, ORC.
What is pruning, and how does it help query performance?
Skipping irrelevant data by using MinMax indexes.
Allows horizontal filtering of rows, reducing I/O.
How is data distributed across a data warehouse?
System-driven distribution → ensures parallelism (each node processes a portion). - STEP 1
User-specified partitioning → improves query performance by enabling partition pruning. - STEP 2 (by Range)
What is the difference between partitioning and distribution?
Partitioning is user-specified, used for data lifecycle management & pruning.
Distribution is system-driven to optimize parallel execution.
How does vectorized execution improve performance?
Processes data in batches (thousands of rows at once).
Reduces I/O overhead and improves cache efficiency.
Examples: Actian Vortex, Hive, Drill, Snowflake, MySQL Heatwave.
What is Just-in-Time (JIT) compilation in query execution?
The query engine compiles optimized code specific to the query before execution.
Improves performance by avoiding interpretation overhead.
Used in AWS Redshift, Tableau HyPer, SingleStore.
What are key challenges when moving OLAP to the cloud?
Storage abstraction → Handling different formats efficiently.
Data distribution & partitioning → Becomes even more critical at scale.
Query optimization → Must be global and resource-aware.
Elasticity → Scaling compute and storage independently.
What are the different service models for cloud OLAP systems?
Reserved-capacity services → Fixed resources, predictable pricing.
Serverless instances → Pay only for executed queries.
Auto-scaling → Adjusts compute resources dynamically.
How does Shared-Nothing Architecture work?
Each query processor node has its own local storage.
Data is horizontally partitioned across nodes.
Minimal communication between nodes.
What are the drawbacks of Shared-Nothing Architecture in the cloud?
Compute & storage tightly coupled → Limits flexibility.
Heterogeneous workloads → Different workloads require different hardware setups.
Membership changes → Rebalancing data when nodes are added/removed is costly.
Online upgrades are difficult due to system coupling.
What is the benefit of separating compute and storage in modern cloud OLAP systems?
Elasticity → Compute and storage scale independently.
Cost-efficiency → Only pay for compute when needed.
Better failure handling → Storage remains available even if compute nodes fail.
Examples: Snowflake, AWS Redshift (new version), BigQuery.
How did the older version of AWS Redshift work?
Shared-nothing architecture.
Leader node parses queries and distributes workloads to compute nodes.
Uses JIT compilation (C++ MPP engine) for execution.
Coupled compute and storage → scaling was difficult.
How does Snowflake separate compute and storage?
Storage: Uses Amazon S3 / Azure Blob Storage / Google Cloud Storage.
Compute: Uses a shared-nothing execution engine (independent from storage).
Metadata & transactions handled separately.
How does Snowflake optimize table storage?
Stores data in compressed, immutable blocks.
Uses MinMax indexes for faster pruning.
Dynamically caches hot data in compute nodes.
How does Google BigQuery optimize large queries?
Uses Colossus DFS (distributed shuffle tier) for fast joins & aggregation.
Dremel Query Engine → Optimized for SQL analytics.
Reduces shuffle latency 10x compared to traditional systems.
How do modern OLAP systems handle node failures?
If a node fails, its workload is reassigned to another node.
Storage is usually replicated (S3, Azure Blob Storage, etc.) to avoid data loss.
Compute nodes can be reinstantiated dynamically.
How does serverless OLAP work?
Queries run on-demand, without fixed resource allocation.
Examples: AWS Athena, BigQuery, Azure SQL Serverless.
What is Disaggregated Compute-Storage Architecture?
Compute and storage are fully independent.
Enables multi-tenancy, elasticity, and better fault tolerance.
Examples: Snowflake, modern AWS Redshift.
What is Disaggregated Compute-Memory-Storage Architecture?
Adds a shared-memory layer to reduce shuffle cost.
Used for complex joins & aggregations.
Example: Google BigQuery (Colossus DFS shuffle tier).
What is a Stateless Shared-Storage Architecture?
Compute nodes do not hold state (metadata & logs stored externally).
Enables dynamic scaling & high availability.
Examples: Azure Synapse (POLARIS engine), BigQuery.
What is Function-as-a-Service (FaaS)?
A serverless compute model where functions are triggered on demand.
Functions execute stateless, event-driven tasks.
Examples: AWS Lambda, Google Cloud Functions.
What are the challenges of serverless OLAP?
Cold starts – Functions take time to initialize.
State management – Functions are stateless and must store data externally.
Stragglers – Some function instances take longer, affecting query latency.
What is OLTP, and how has the traditional database stack evolved over time?
OLTP (Online Transaction Processing) deals with high-throughput, low-latency transactions.
The monolithic database stack (SQL, transactions, caching, logging, storage) has remained largely unchanged for 30-40 years.
What are the three conventional approaches for scaling databases?
Sharding – Splitting data at the application layer.
Shared-Nothing – Each node has its own storage and transactions.
Shared-Disk – Nodes share storage but handle transactions separately.
What makes OLTP challenging in cloud environments?
Failures happen frequently (permanent vs. transient failures).
Geo-distributed data centers introduce latency.
Storage types vary (fast/expensive vs. slow/cheap).
Ephemeral vs. persistent storage requires different strategies.
What are the key expectations for cloud-based OLTP (DBaaS)?
✅ Durability – Ensuring no data loss.
✅ High scalability – Handling up to 100TB+ databases.
✅ Fault tolerance – Fast recovery from failures.
✅ High availability – 99.999% uptime.
✅ Low tail latency – Minimize slow queries.
✅ Elasticity – Scaling resources dynamically.
✅ Cost efficiency – Balance performance and cost.
✅ Data locality guarantees – Compliance with laws like GDPR.
What are the four main cloud OLTP architectures?
- Non-partitioned shared-nothing log-replicated state machine – Microsoft SQL Server HADR, PostgreSQL on AWS
- Non-partitioned shared-data – AWS Aurora, Google AlloyDB, Azure Socrates.
- Partitioned shared-nothing log-replicated state machine – Google Spanner, CockroachDB, MongoDB.
- Disaggregated architectures – Separate compute, storage, and logging layers. AWS Athena, MongoDB on Atlas
How does Microsoft SQL Server’s HADR system work?
Primary node processes all updates.
Follower replicas receive update logs but handle only read queries.
Backups to Azure Storage every 5 minutes.
Requires 4 nodes (1 primary, 3 secondary) for durability and high availability.
What are the advantages of HADR?
✅ Mature and widely used in Azure.
✅ High performance since each compute node has a full local copy of the database.
What are the limitations of HADR?
❌ Database size is limited to a single node’s storage (~4TB).
❌ Replication is tightly coupled – scaling is expensive.
❌ Long-running transactions can block log truncation, causing issues.
How does AWS Aurora decouple compute and storage?
Compute nodes only write redo logs to the distributed storage layer.
Storage reconstructs pages on demand from logs.
Storage acts as a cache – materializing pages only when needed.
What are the advantages of AWS Aurora design?
✅ Lower network I/O – Only log records are sent over the network.
✅ Faster recovery – No need to replay large amounts of redo logs.
✅ High availability – Aurora replicates data 6 times across 3 availability zones (AZs).
✅ Improved durability – Can survive entire data center failures.
How does Aurora handle fast repairs and catch-up?
Shards databases into 10GB chunks.
Each chunk is replicated 6 times across a distributed storage fleet.
Quorum-based replication (4/6 write quorum, 3/6 read quorum) allows quick failover.
What is the key insight behind Socrates’ architecture?
Separates durability and availability:
Durability → Handled by a high-speed log service.
Availability → Handled by storage tier with replicas.
Why does Socrates treat the log separately?
Log is the bottleneck – every update must be logged before committing.
Solution: Store logs in fast storage and replicate them for fault tolerance.
How does Socrates ensure low commit latency?
Logs are written to a low-latency, durable storage layer (LZ).
XLOG process asynchronously disseminates logs to replicas.
Page servers store checkpoints and backups in Azure Storage.
What are the advantages of this Socrates architecture?
✅ Low write latency – Log service optimizes transaction commits.
✅ Better elasticity – Compute, log, and storage can scale independently.
What are the Socrates limitations?
❌ Higher read latency – If cache misses occur, logs must be replayed.
❌ More complex recovery mechanisms – Logs must be carefully managed.
What is the main motivation for PolarDB Serverless?
Reduce read latency using shared remote memory.
Reduce duplicate data loading across compute nodes.
Elasticity – Allocate memory on demand.
How does the data read/write path work in PolarDB Serverless?
Writes: Compute node updates cache → generates redo log → logs stored durably → updates propagate.
Reads: Compute node checks cache → If missing, fetches from shared buffer → If missing, fetches from storage.
What are the benefits of read/write in PolarDB Serverless?
✅ Low read latency – Reads from remote memory instead of disk.
✅ High throughput – Different compute nodes share the same buffer.
✅ Elastic memory allocation – Independent of compute/storage scaling.
What are the PolarDB downsides?
❌ Network bottleneck – Remote memory access requires high-speed networking.
What is the Two-Phase Commit (2PC) Protocol?
Prepare phase – Nodes agree to commit or abort.
Commit phase – Coordinator sends final decision.
Why is 2PC problematic in distributed OLTP?
❌ Coordinator crashes cause blocking issues.
❌ Slow commit times if multiple nodes are involved.
✅ Solution: Use Paxos Commit to tolerate failures.
What makes Google Spanner unique?
Global-scale distributed transactions with serializability.
Uses TrueTime – Synchronizes timestamps via atomic clocks + GPS.
Partitions data into “splits” – Replicated using Paxos for consistency.
How does Spanner ensure consistent snapshots?
Uses Multi-Version Concurrency Control (MVCC).
Read-only transactions do not need locks.