Week 7: Scalable Data Storage Flashcards

1
Q

What is Google Bigtable?

A
  • A fully managed, highly scalable NoSQL database service developed by Google Cloud. It’s designed to handle massive amounts of structured data with low latency for real-time analytics and operational workloads.
  • Bigtable automatically partitions data into tablets (the basic units of data distribution) and assigns them to nodes in a cluster. This architecture enables horizontal scalability.
  • Data is organized into rows and column families. Each row is uniquely identified by a row key, and data within a row is sorted lexicographically by these keys.
  • Data is first written to an in-memory structure (memtable) and then periodically flushed to disk as immutable SSTables. This write-optimized design uses a log-structured merge (LSM) storage model.
  • In the cloud, persistent storage (such as Google Cloud Storage) holds the data, ensuring durability and efficient retrieval.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to query efficiently with Bigtable?

A

-The most efficient queries are those that use the row key.
- Use prefix-based scans for range queries.
- Apply server-side filters to minimize data transfer.
- Avoid full-table scans as they are costly and slow.
Row Key Design:
- Keys should be designed to distribute the load evenly (avoid sequential keys that can create hotspots).
- Use hashing or salting techniques to mitigate potential bottlenecks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Advantages and disadvantages of BigTable

A

Advantages:
- Designed for millions of writes per second with sub-10ms latency.
- Single-row operations are strongly consistent.
- Column-family based design allows efficient storage even when many columns are empty.
- Simplifies horizontal scaling.

Disadvantages:
- No support for SQL joins or secondary indexes.
- Poor row key design can lead to hotspots and performance degradation.
- Only supports atomicity at the single-row level.

When to Use:
High-ingestion workloads, real-time analytics, IoT data, time-series data, and applications that require fast row lookups.

When Not to Use:
Workloads requiring complex multi-row transactions, multi-table joins, or rich SQL querying capabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is HBase?

A

Apache HBase is an open-source, distributed, column-oriented data store built on top of the Hadoop Distributed File System (HDFS). Modeled after Google’s Bigtable, it provides real-time read/write access to large, sparse datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hbase vs HDFS

A

HDFS (Hadoop Distributed File System):
- Designed for batch processing of very large files.
- Optimized for high throughput but not for random or record-level access.
- Not suitable for small, incremental updates or random lookups.
HBase:
- Builds on HDFS by adding a database abstraction (tables, rows, and columns).
- Supports random, real-time read and write operations.
- Ideal for applications that require quick record lookups, updates, and small batch insertions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are HBase building blocks?

A

HFile: The basic on-disk storage format. Represents an immutable, sorted map from keys to values. Stored in HDFS and optimized for efficient lookups via block indexing.

MemStore: An in-memory write buffer that temporarily stores updates. Flushes its contents to disk as HFiles once certain thresholds are reached.

HRegion: A contiguous range of rows in an HBase table. Acts as the fundamental unit of data distribution and load balancing.

HRegionServer: The server process responsible for managing one or more HRegions. Handles all read and write requests for its assigned regions.

HBase Master: Oversees the cluster by assigning HRegions to region servers. Monitors region server health, manages schema changes, and performs load balancing.

ZooKeeper: Provides coordination and synchronization between HBase nodes. Maintains metadata (e.g., region assignments) and ensures that each HRegion is exclusively locked to one region server.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is HRegion assignment?

A

Dynamic Assignment: The HBase Master continuously monitors live HRegionServers (using ZooKeeper) and assigns HRegions to them. Each HRegion is exclusively managed by one HRegionServer at a time.

Load Balancing and Failover: Regions may be re-assigned if a server fails or if new servers are added. HRegions that grow too large can be split by the HRegionServer; conversely, small regions can be merged.

Coordination with ZooKeeper: ZooKeeper is used to maintain exclusive locks on regions, ensuring that no two servers serve the same region simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is caching necessary?

A

Caching is essential to reduce latency. Cached data (often stored in memory) can be delivered in sub-millisecond times compared to slower database queries. Improved responsiveness can significantly affect user experience and even sales, as small delays (e.g., 100–250 milliseconds) can lead to noticeable decreases in engagement and revenue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the principle of locality?

A

Temporal Locality:
The tendency of a processor or application to repeatedly access the same memory locations over a short time period.
Spatial Locality:
If a particular data item is accessed, nearby data items are likely to be accessed soon after.
Application Across Systems:
The idea of locality is universal, whether in CPU caches, operating system memory management (like paging and TLBs), or distributed systems such as web caching and CDNs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Caching in processors, OS, and distributed systems

A

Processors: Multi-level caches (L1, L2, L3) for instructions and data optimize the speed of CPU operations.

Operating Systems: Virtual memory systems use caches (e.g., TLBs, page caches) to speed up address translation and file access.

Distributed Systems: Web servers, CDNs, and reverse proxies (like Nginx or Varnish) use caching to serve content faster by reducing distance and load on the origin servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cache hit vs miss

A

Cache Hit: Occurs when the requested data is found in the cache.

Cache Miss: Happens when the data is not in the cache, forcing a retrieval from a slower source (e.g., a database) and subsequently updating the cache.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cache replacement policies

A

Least Recently Used (LRU): Evicts the oldest (least used) data entry to make room for new data.
FIFO (First-In, First-Out), Least Frequently Used (LFU), and time-aware LRU (which factors in expiration times) are also common strategies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cache writing/updating policies

A

Cache-Aside (Lazy Loading): The application checks the cache first; on a miss, it fetches data from the source, returns it, and then updates the cache.

Write-Through: Data is written to both the cache and the backing store simultaneously, ensuring consistency but adding latency.

Write-Back (Write-Behind): Data is written only to the cache initially and asynchronously flushed to the backing store later; this can improve write performance but may risk stale data.

Write-Around: Data is written directly to the backing store, bypassing the cache, which can be beneficial when data is not expected to be re-read soon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cache coherency mechanisms

A

Maintaining Consistency: In systems with multiple caches, mechanisms such as snooping (caches listening to each other) and directory-based protocols (a central manager for cache state) help keep data consistent.

Example Protocol: The MESI (Modified, Exclusive, Shared, Invalid) protocol is a widely used scheme in CPU caches to manage coherence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Basic properties of Memcached

A

Basic Properties: A simple, in-memory key-value store designed for speed. It does not support persistence; if a node fails, the cached data is lost.
Data Distribution: Uses consistent hashing to distribute keys across a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does Autodiscovery feature assist clients

A

Provides a configuration endpoint that dynamically lists all active nodes in the cluster, allowing client libraries to adjust automatically to changes such as node additions or failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Control flow for reading a key value pair from Memcached

A

Check cache → On miss, fetch from the database → Update the cache with the new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Replication groups in Redis

A

Redis clusters use node groups (or shards) where each shard has one primary node for read/write and multiple replica nodes for read scaling and high availability. Supports a high number of shards (up to 250 or more), which can be increased with newer versions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Redis vs Memcached

A

Redis: Supports a variety of data structures (strings, hashes, lists, sets, sorted sets, etc.), offers persistence, and has built-in replication and advanced features (pub/sub, atomic operations).
Memcached: Simpler and optimized solely for quick key-value retrieval without persistence.

20
Q

Cache writing/updating strategies < - > Scenarios

A

When to Use Cache-Aside: Best for read-heavy applications where caching only data that is actually read reduces unnecessary cache writes.

When to Use Write-Through: Ideal for write-heavy applications or where data consistency is critical, as every update is immediately reflected in both the cache and the source.

Trade-Offs: Lazy loading (cache-aside) can lead to stale data if the underlying data changes, while write-through can incur higher latency per write.

21
Q

Cache Sharding

A

Data Distribution: Sharding splits the data across multiple nodes (using techniques like consistent hashing) to handle large datasets and balance load.

Challenges: Re-sharding when nodes are added or removed can be complex, and uneven access patterns (e.g., “celebrity” data) can overload certain shards.

22
Q

How does TTL help improve Write Back Strategy

A

Purpose: TTL ensures that cached data expires after a set period, which helps prevent the cache from becoming stale.

Improving Write-Back Strategy: By expiring old data automatically, TTL minimizes the risk of serving outdated information and helps manage cache churn effectively.

23
Q

What is Redis and what are its main properties?

A

Core Characteristics: Redis is an in-memory key-value store designed for speed, supporting various data types including strings, hashes, lists, sets, and sorted sets. It is written in C and utilizes a single-threaded, non-blocking I/O model to achieve high performance.

Persistence Options: Although primarily in-memory, Redis can periodically write its state to disk (checkpointing) to enable recovery with minimal data loss.

Advanced Features: Atomic operations (e.g., increment/decrement), pub/sub messaging, and support for complex data structures make Redis versatile for many applications.

24
Q

Redis use cases?

A

Session Storage: Ideal for maintaining user sessions in web applications due to its speed and in-memory data model.

Caching Dynamic Content: Frequently used to cache API responses, HTML pages, or any data that benefits from rapid access.

Real-Time Applications: Suitable for leaderboards, real-time analytics, and logging, where rapid updates and reads are crucial.

Temporary Data Storage: Effective for storing ephemeral data, temporary records, or any dataset where persistence is not the highest priority.

25
Decoupling in cloud architecture
Definition & Importance: Decoupling means breaking down an application into individual components that communicate asynchronously. This separation allows each part to scale, update, or fail independently without affecting the entire system. Benefits: Systems can easily expand or shrink in response to demand. Component failures are isolated, improving overall system resiliency. Different teams can work on separate components without tight integration constraints.
26
Message Queues and Decoupling
Role & Function: Message queues act as intermediaries between producers (which send messages) and consumers (which receive them). This buffering mechanism enables asynchronous communication and helps prevent system overload during traffic spikes.
27
Publish Subscribe model
Overview: Unlike the typical queue model (where each message is consumed by a single consumer), the publish–subscribe (pub–sub) model allows messages to be delivered to multiple subscribers. Key Characteristics: - Fan-out Architecture: A single published message can trigger notifications to several endpoints (such as SQS queues, Lambda functions, HTTP endpoints, email, SMS, or mobile push notifications). - Push vs. Pull: In pub–sub systems like AWS SNS, messages are pushed to subscribers rather than being pulled.
28
How do queues achieve decoupling?
- Loose Coupling: Producers and consumers do not interact directly; instead, they exchange messages via the queue. - Reliability: Even if one component is slow or fails, messages are safely stored until they can be processed.
29
What is the put-get-delete paradigm?
Put: Producers send (or “put”) messages into the SQS queue. Get: Consumers pull (or “get”) messages from the queue. When a message is retrieved, it becomes temporarily invisible to other consumers. Delete: Once a consumer has successfully processed the message, it sends a delete command to permanently remove the message from the queue.
30
At-least-Once and Exactly-Once guarantee (in SQS)
At-Least-Once Delivery (Standard Queues): A message is delivered at least once. If a consumer fails to delete it (for example, due to processing errors), the message becomes visible again for reprocessing. Exactly-Once Processing (FIFO Queues): FIFO (First-In, First-Out) queues aim to deliver each message exactly once with guaranteed order. This is achieved using content-based deduplication (using an SHA-256 hash of the message body). Note that exactly-once semantics are only achievable under limited conditions and come with trade-offs such as reduced throughput.
31
Short vs Long Pull
Short Pull: Consumers quickly check for new messages. If none are available, the call returns empty immediately, which might lead to frequent polling and increased costs. Long Pull: Consumers can set a waiting period (up to 20 seconds) during which the request will wait for a message to become available. This reduces the number of requests and helps control costs.
32
How does SQS deal with failures?
Visibility Timeout: When a consumer retrieves a message, it becomes invisible for a set period. If the consumer fails to process and delete the message within this timeout, it becomes visible again for other consumers. Dead Letter Queue (DLQ): After a message fails processing a predetermined number of times, it can be moved to a DLQ. This allows for isolation and troubleshooting of problematic messages without blocking the main queue.
33
Topics (SNS)
A topic is a communication channel to which publishers send messages. Subscribers register with topics to receive notifications. This setup supports a one-to-many delivery model where a single published message is pushed to all subscribed endpoints. Push Model: Unlike SQS, SNS actively pushes messages to subscribers instead of waiting for consumers to poll.
34
General contents of SNS messages
Subject: A brief title or summary of the message. Time to Live (TTL): Specifies how long a message is valid. If the message is not delivered within this time, it is dropped. Payload: The main content of the message, which can be customized per endpoint (allowing for different formats depending on whether the subscriber is an email, HTTP endpoint, etc.).
35
What is Kafka (main components and characteristics) and why is it important?
Definition: Apache Kafka is a distributed event streaming platform designed for high-throughput, low-latency data pipelines and real-time processing. Its ability to decouple data producers from consumers and handle massive message volumes makes it ideal for modern microservices and big data architectures. Characteristics: Horizontally scales by distributing topics across multiple brokers and partitions. Messages are stored in an immutable, append-only log with configurable replication to protect against failures. Supports both publish–subscribe (pub/sub) and queue-based messaging patterns, enabling asynchronous communication across systems. Optimized for handling millions of messages per second, making it suitable for real-time analytics, ETL pipelines, and data lake ingestion.
36
What are Kafka topics, partitions, and offsets, and how do they enable scalable message storage?
Topics: Logical channels that group related messages (e.g., “user_actions” or “inventory_updates”). Partitions: Each topic is split into one or more partitions—each an ordered, immutable, append-only log. This design enables parallelism since different partitions can be hosted on different brokers. Offsets: Each message within a partition is assigned a unique, sequential offset. Consumers use these offsets to track their reading progress. Offsets are critical for replayability: if a consumer fails or needs to reprocess data, it can restart from a known offset without data loss. Scalable message: This architecture supports scalable message storage by allowing multiple consumers to process data in parallel and by letting topics grow independently of consumer speed.
37
How do producers and consumers interact with Kafka?
Producers: Role: Applications that send (publish) messages to Kafka topics. Mechanism: Use the Producer API to send records and can control aspects like batching, compression, and acknowledgment settings. Partitioning: Producers can specify which partition a message should go to or let Kafka’s default strategies (e.g., round-robin or key-based hashing) determine the partition. Consumers: Role: Applications that subscribe to topics and process incoming messages. Consumer Groups: Consumers typically work as part of a group so that each partition is processed by only one consumer at a time. This ensures load balancing and parallel processing. Offset Management: Kafka tracks consumer offsets (often stored in an internal topic) to ensure that each consumer resumes reading from the correct position.
38
Partitioning strategies for producers
Round-Robin: When no key is provided, messages are evenly distributed across partitions. Hash-Based: When a key is provided, Kafka hashes the key to determine the partition—ensuring that messages with the same key (e.g., user ID) always go to the same partition, which helps preserve ordering. Custom Partitioners: Developers can implement custom partitioning logic to meet specific requirements (e.g., grouping related events together or achieving better load balance). Pros: Ensures parallelism, load distribution, and ordering when needed. Cons: Using key-based partitioning may lead to “hot partitions” if a particular key generates a large volume of messages. Increasing partitions later may also break ordering guarantees for keys.
39
How does Kafka handle message retention? Different types (time-based, size-based) and the trade-offs
Time-Based Retention: Messages can be configured to be retained for a set period (e.g., 7 days). Size-Based Retention: Retention can also be based on log size (e.g., 10 GB per topic). Trade-Offs: Longer Retention: Enables reprocessing and serves use cases like auditing and analytics but requires more storage. Shorter Retention: Conserves storage but may limit the ability to replay events in case of consumer failure or when historical analysis is needed.
40
Difference between at-least-once and exactly-once processing and its pros, cons
At-Least-Once Processing Default Behavior: Guarantees that messages are never lost, but duplicates may occur in case of consumer reprocessing or network issues. Handling Duplicates: Requires idempotent consumer logic or external deduplication strategies (e.g., using an idempotency key stored in a cache or database). Exactly-Once Processing Mechanism: Utilizes Kafka transactions, idempotent producers, and Kafka Streams to ensure each message is processed only once—even in failure scenarios. Trade-Off: Provides stronger consistency and eliminates duplicates, but introduces additional coordination overhead, potentially reducing throughput.
41
How do Kafka Streams enable real-time data processing?
Built on Kafka’s producer and consumer APIs, it allows you to process data in real time directly from Kafka. Ideal for building applications that require real-time analytics, event-driven processing, or stateful transformations without needing a separate processing cluster Operations Supported: Stateless: Simple mappings, filtering, etc. Stateful: Windowing (tumbling, hopping, sliding), aggregations, and joins. Fault Tolerance: Maintains local state with changelog topics for recovery.
42
Kafka vs RabbitMQ vs Amazon SQS
Kafka: Architecture: Log-based storage with partitioning and replication. Strengths: High throughput, low latency, built-in fault tolerance, and replay capabilities. Best For: Streaming data pipelines, event-driven architectures, and large-scale log aggregation. RabbitMQ: Model: Traditional queue model with complex routing capabilities. Strengths: Excellent for flexible routing and scenarios requiring complex message patterns. Best For: Asynchronous workloads that need sophisticated routing. Amazon SQS: Model: Fully managed, serverless message queue service with simplicity in mind. Strengths: Removes infrastructure management and is easy to integrate with AWS services. Best For: Smaller-scale or simpler use cases where managed service benefits are desired.
43
Common use cases for Kafka
Event-Driven Architectures: Decouple microservices by streaming events between them. Real-Time Analytics: Build dashboards and monitoring systems that process and analyze data on the fly. ETL Pipelines: Extract, transform, and load data into data lakes or warehouses continuously. Data Lake Ingestion: Stream data from various sources (applications, sensors, IoT devices) into a centralized repository. Log Aggregation & Monitoring: Collect and process log data for system monitoring and troubleshooting.
44
What is managed Kafka in the cloud (AWS MSK, Azure Event Hubs) and why is it important?
Managed Kafka services (e.g., AWS MSK, Azure Event Hubs, Google Cloud Pub/Sub) provide a cloud-based Kafka deployment where infrastructure management, scaling, and maintenance are handled by the provider. Key Benefits Simplified Deployment: No need to set up and manage Kafka clusters manually. Automated Scaling & Monitoring: Cloud services automatically adjust resources based on workload. High Availability & Security: Built-in replication, multi-AZ deployments, and robust security features (e.g., TLS, IAM integration). Cost Efficiency: Pay-as-you-go pricing and reduced operational overhead.
45
How to optimize Kafka deployments (partitioning strategies, monitoring consumer lag, and integrating with cloud services)?
Partitioning Strategy: Right Sizing: Plan the number of partitions based on expected load and consumer counts. Balancing Act: More partitions increase parallelism but add overhead; too few may limit throughput. Monitoring Consumer Lag: Why It Matters: High consumer lag can indicate bottlenecks or processing delays. Tools: Use built-in metrics and external monitoring (e.g., CloudWatch, Prometheus) to keep track of performance. Integration with Cloud Services: Managed Services: Leverage managed Kafka offerings to reduce operational complexity. Configuration Tuning: Adjust producer batch sizes, compression, consumer fetch sizes, and JVM settings to maximize performance. Best Practices: Regularly test and adjust your partitioning strategy in staging environments before rolling changes to production. Balance throughput with latency requirements, especially when using key-based partitioning which can affect ordering.
46