Week 7: Scalable Data Storage Flashcards
What is Google Bigtable?
- A fully managed, highly scalable NoSQL database service developed by Google Cloud. It’s designed to handle massive amounts of structured data with low latency for real-time analytics and operational workloads.
- Bigtable automatically partitions data into tablets (the basic units of data distribution) and assigns them to nodes in a cluster. This architecture enables horizontal scalability.
- Data is organized into rows and column families. Each row is uniquely identified by a row key, and data within a row is sorted lexicographically by these keys.
- Data is first written to an in-memory structure (memtable) and then periodically flushed to disk as immutable SSTables. This write-optimized design uses a log-structured merge (LSM) storage model.
- In the cloud, persistent storage (such as Google Cloud Storage) holds the data, ensuring durability and efficient retrieval.
How to query efficiently with Bigtable?
-The most efficient queries are those that use the row key.
- Use prefix-based scans for range queries.
- Apply server-side filters to minimize data transfer.
- Avoid full-table scans as they are costly and slow.
Row Key Design:
- Keys should be designed to distribute the load evenly (avoid sequential keys that can create hotspots).
- Use hashing or salting techniques to mitigate potential bottlenecks.
Advantages and disadvantages of BigTable
Advantages:
- Designed for millions of writes per second with sub-10ms latency.
- Single-row operations are strongly consistent.
- Column-family based design allows efficient storage even when many columns are empty.
- Simplifies horizontal scaling.
Disadvantages:
- No support for SQL joins or secondary indexes.
- Poor row key design can lead to hotspots and performance degradation.
- Only supports atomicity at the single-row level.
When to Use:
High-ingestion workloads, real-time analytics, IoT data, time-series data, and applications that require fast row lookups.
When Not to Use:
Workloads requiring complex multi-row transactions, multi-table joins, or rich SQL querying capabilities.
What is HBase?
Apache HBase is an open-source, distributed, column-oriented data store built on top of the Hadoop Distributed File System (HDFS). Modeled after Google’s Bigtable, it provides real-time read/write access to large, sparse datasets
Hbase vs HDFS
HDFS (Hadoop Distributed File System):
- Designed for batch processing of very large files.
- Optimized for high throughput but not for random or record-level access.
- Not suitable for small, incremental updates or random lookups.
HBase:
- Builds on HDFS by adding a database abstraction (tables, rows, and columns).
- Supports random, real-time read and write operations.
- Ideal for applications that require quick record lookups, updates, and small batch insertions.
What are HBase building blocks?
HFile: The basic on-disk storage format. Represents an immutable, sorted map from keys to values. Stored in HDFS and optimized for efficient lookups via block indexing.
MemStore: An in-memory write buffer that temporarily stores updates. Flushes its contents to disk as HFiles once certain thresholds are reached.
HRegion: A contiguous range of rows in an HBase table. Acts as the fundamental unit of data distribution and load balancing.
HRegionServer: The server process responsible for managing one or more HRegions. Handles all read and write requests for its assigned regions.
HBase Master: Oversees the cluster by assigning HRegions to region servers. Monitors region server health, manages schema changes, and performs load balancing.
ZooKeeper: Provides coordination and synchronization between HBase nodes. Maintains metadata (e.g., region assignments) and ensures that each HRegion is exclusively locked to one region server.
What is HRegion assignment?
Dynamic Assignment: The HBase Master continuously monitors live HRegionServers (using ZooKeeper) and assigns HRegions to them. Each HRegion is exclusively managed by one HRegionServer at a time.
Load Balancing and Failover: Regions may be re-assigned if a server fails or if new servers are added. HRegions that grow too large can be split by the HRegionServer; conversely, small regions can be merged.
Coordination with ZooKeeper: ZooKeeper is used to maintain exclusive locks on regions, ensuring that no two servers serve the same region simultaneously.
Why is caching necessary?
Caching is essential to reduce latency. Cached data (often stored in memory) can be delivered in sub-millisecond times compared to slower database queries. Improved responsiveness can significantly affect user experience and even sales, as small delays (e.g., 100–250 milliseconds) can lead to noticeable decreases in engagement and revenue.
What is the principle of locality?
Temporal Locality:
The tendency of a processor or application to repeatedly access the same memory locations over a short time period.
Spatial Locality:
If a particular data item is accessed, nearby data items are likely to be accessed soon after.
Application Across Systems:
The idea of locality is universal, whether in CPU caches, operating system memory management (like paging and TLBs), or distributed systems such as web caching and CDNs.
Caching in processors, OS, and distributed systems
Processors: Multi-level caches (L1, L2, L3) for instructions and data optimize the speed of CPU operations.
Operating Systems: Virtual memory systems use caches (e.g., TLBs, page caches) to speed up address translation and file access.
Distributed Systems: Web servers, CDNs, and reverse proxies (like Nginx or Varnish) use caching to serve content faster by reducing distance and load on the origin servers.
Cache hit vs miss
Cache Hit: Occurs when the requested data is found in the cache.
Cache Miss: Happens when the data is not in the cache, forcing a retrieval from a slower source (e.g., a database) and subsequently updating the cache.
Cache replacement policies
Least Recently Used (LRU): Evicts the oldest (least used) data entry to make room for new data.
FIFO (First-In, First-Out), Least Frequently Used (LFU), and time-aware LRU (which factors in expiration times) are also common strategies.
Cache writing/updating policies
Cache-Aside (Lazy Loading): The application checks the cache first; on a miss, it fetches data from the source, returns it, and then updates the cache.
Write-Through: Data is written to both the cache and the backing store simultaneously, ensuring consistency but adding latency.
Write-Back (Write-Behind): Data is written only to the cache initially and asynchronously flushed to the backing store later; this can improve write performance but may risk stale data.
Write-Around: Data is written directly to the backing store, bypassing the cache, which can be beneficial when data is not expected to be re-read soon.
Cache coherency mechanisms
Maintaining Consistency: In systems with multiple caches, mechanisms such as snooping (caches listening to each other) and directory-based protocols (a central manager for cache state) help keep data consistent.
Example Protocol: The MESI (Modified, Exclusive, Shared, Invalid) protocol is a widely used scheme in CPU caches to manage coherence.
Basic properties of Memcached
Basic Properties: A simple, in-memory key-value store designed for speed. It does not support persistence; if a node fails, the cached data is lost.
Data Distribution: Uses consistent hashing to distribute keys across a cluster.
How does Autodiscovery feature assist clients
Provides a configuration endpoint that dynamically lists all active nodes in the cluster, allowing client libraries to adjust automatically to changes such as node additions or failures.
Control flow for reading a key value pair from Memcached
Check cache → On miss, fetch from the database → Update the cache with the new data.
Replication groups in Redis
Redis clusters use node groups (or shards) where each shard has one primary node for read/write and multiple replica nodes for read scaling and high availability. Supports a high number of shards (up to 250 or more), which can be increased with newer versions.
Redis vs Memcached
Redis: Supports a variety of data structures (strings, hashes, lists, sets, sorted sets, etc.), offers persistence, and has built-in replication and advanced features (pub/sub, atomic operations).
Memcached: Simpler and optimized solely for quick key-value retrieval without persistence.
Cache writing/updating strategies < - > Scenarios
When to Use Cache-Aside: Best for read-heavy applications where caching only data that is actually read reduces unnecessary cache writes.
When to Use Write-Through: Ideal for write-heavy applications or where data consistency is critical, as every update is immediately reflected in both the cache and the source.
Trade-Offs: Lazy loading (cache-aside) can lead to stale data if the underlying data changes, while write-through can incur higher latency per write.
Cache Sharding
Data Distribution: Sharding splits the data across multiple nodes (using techniques like consistent hashing) to handle large datasets and balance load.
Challenges: Re-sharding when nodes are added or removed can be complex, and uneven access patterns (e.g., “celebrity” data) can overload certain shards.
How does TTL help improve Write Back Strategy
Purpose: TTL ensures that cached data expires after a set period, which helps prevent the cache from becoming stale.
Improving Write-Back Strategy: By expiring old data automatically, TTL minimizes the risk of serving outdated information and helps manage cache churn effectively.
What is Redis and what are its main properties?
Core Characteristics: Redis is an in-memory key-value store designed for speed, supporting various data types including strings, hashes, lists, sets, and sorted sets. It is written in C and utilizes a single-threaded, non-blocking I/O model to achieve high performance.
Persistence Options: Although primarily in-memory, Redis can periodically write its state to disk (checkpointing) to enable recovery with minimal data loss.
Advanced Features: Atomic operations (e.g., increment/decrement), pub/sub messaging, and support for complex data structures make Redis versatile for many applications.
Redis use cases?
Session Storage: Ideal for maintaining user sessions in web applications due to its speed and in-memory data model.
Caching Dynamic Content: Frequently used to cache API responses, HTML pages, or any data that benefits from rapid access.
Real-Time Applications: Suitable for leaderboards, real-time analytics, and logging, where rapid updates and reads are crucial.
Temporary Data Storage: Effective for storing ephemeral data, temporary records, or any dataset where persistence is not the highest priority.