L9/12 - OLAP & OLTP in the cloud (DBaaS) Flashcards
(48 cards)
What is the dominant architecture for high-performance data warehousing?
Shared-nothing architectures, where each node has its own storage, memory, and processing power.
What are the three main architectural dimensions of a data warehouse?
Storage – Data layout, format, partitioning, and distribution.
Query Engine – Optimization, execution models, parallelism, and multi-tenancy.
Cluster – Metadata sharing, resource management, and allocation.
What are the advantages of a column-store over a row-store?
Only reads relevant columns → better CPU cache usage.
More efficient compression (e.g., RLE, gzip).
Faster analytical queries, as fewer disk reads are required.
Examples: Apache Parquet, ORC.
What is pruning, and how does it help query performance?
Skipping irrelevant data by using MinMax indexes.
Allows horizontal filtering of rows, reducing I/O.
How is data distributed across a data warehouse?
System-driven distribution → ensures parallelism (each node processes a portion). - STEP 1
User-specified partitioning → improves query performance by enabling partition pruning. - STEP 2 (by Range)
What is the difference between partitioning and distribution?
Partitioning is user-specified, used for data lifecycle management & pruning.
Distribution is system-driven to optimize parallel execution.
How does vectorized execution improve performance?
Processes data in batches (thousands of rows at once).
Reduces I/O overhead and improves cache efficiency.
Examples: Actian Vortex, Hive, Drill, Snowflake, MySQL Heatwave.
What is Just-in-Time (JIT) compilation in query execution?
The query engine compiles optimized code specific to the query before execution.
Improves performance by avoiding interpretation overhead.
Used in AWS Redshift, Tableau HyPer, SingleStore.
What are key challenges when moving OLAP to the cloud?
Storage abstraction → Handling different formats efficiently.
Data distribution & partitioning → Becomes even more critical at scale.
Query optimization → Must be global and resource-aware.
Elasticity → Scaling compute and storage independently.
What are the different service models for cloud OLAP systems?
Reserved-capacity services → Fixed resources, predictable pricing.
Serverless instances → Pay only for executed queries.
Auto-scaling → Adjusts compute resources dynamically.
How does Shared-Nothing Architecture work?
Each query processor node has its own local storage.
Data is horizontally partitioned across nodes.
Minimal communication between nodes.
What are the drawbacks of Shared-Nothing Architecture in the cloud?
Compute & storage tightly coupled → Limits flexibility.
Heterogeneous workloads → Different workloads require different hardware setups.
Membership changes → Rebalancing data when nodes are added/removed is costly.
Online upgrades are difficult due to system coupling.
What is the benefit of separating compute and storage in modern cloud OLAP systems?
Elasticity → Compute and storage scale independently.
Cost-efficiency → Only pay for compute when needed.
Better failure handling → Storage remains available even if compute nodes fail.
Examples: Snowflake, AWS Redshift (new version), BigQuery.
How did the older version of AWS Redshift work?
Shared-nothing architecture.
Leader node parses queries and distributes workloads to compute nodes.
Uses JIT compilation (C++ MPP engine) for execution.
Coupled compute and storage → scaling was difficult.
How does Snowflake separate compute and storage?
Storage: Uses Amazon S3 / Azure Blob Storage / Google Cloud Storage.
Compute: Uses a shared-nothing execution engine (independent from storage).
Metadata & transactions handled separately.
How does Snowflake optimize table storage?
Stores data in compressed, immutable blocks.
Uses MinMax indexes for faster pruning.
Dynamically caches hot data in compute nodes.
How does Google BigQuery optimize large queries?
Uses Colossus DFS (distributed shuffle tier) for fast joins & aggregation.
Dremel Query Engine → Optimized for SQL analytics.
Reduces shuffle latency 10x compared to traditional systems.
How do modern OLAP systems handle node failures?
If a node fails, its workload is reassigned to another node.
Storage is usually replicated (S3, Azure Blob Storage, etc.) to avoid data loss.
Compute nodes can be reinstantiated dynamically.
How does serverless OLAP work?
Queries run on-demand, without fixed resource allocation.
Examples: AWS Athena, BigQuery, Azure SQL Serverless.
What is Disaggregated Compute-Storage Architecture?
Compute and storage are fully independent.
Enables multi-tenancy, elasticity, and better fault tolerance.
Examples: Snowflake, modern AWS Redshift.
What is Disaggregated Compute-Memory-Storage Architecture?
Adds a shared-memory layer to reduce shuffle cost.
Used for complex joins & aggregations.
Example: Google BigQuery (Colossus DFS shuffle tier).
What is a Stateless Shared-Storage Architecture?
Compute nodes do not hold state (metadata & logs stored externally).
Enables dynamic scaling & high availability.
Examples: Azure Synapse (POLARIS engine), BigQuery.
What is Function-as-a-Service (FaaS)?
A serverless compute model where functions are triggered on demand.
Functions execute stateless, event-driven tasks.
Examples: AWS Lambda, Google Cloud Functions.
What are the challenges of serverless OLAP?
Cold starts – Functions take time to initialize.
State management – Functions are stateless and must store data externally.
Stragglers – Some function instances take longer, affecting query latency.