Week 7: Scalable Data Storage Flashcards

Question

Decoupling in cloud architecture

Answer 1

Definition & Importance: Decoupling means breaking down an application into individual components that communicate asynchronously. This separation allows each part to scale, update, or fail independently without affecting the entire system. Benefits: Systems can easily expand or shrink in response to demand. Component failures are isolated, improving overall system resiliency. Different teams can work on separate components without tight integration constraints.

Answer 2

Role & Function: Message queues act as intermediaries between producers (which send messages) and consumers (which receive them). This buffering mechanism enables asynchronous communication and helps prevent system overload during traffic spikes.

Answer 3

Overview: Unlike the typical queue model (where each message is consumed by a single consumer), the publish–subscribe (pub–sub) model allows messages to be delivered to multiple subscribers. Key Characteristics: - Fan-out Architecture: A single published message can trigger notifications to several endpoints (such as SQS queues, Lambda functions, HTTP endpoints, email, SMS, or mobile push notifications). - Push vs. Pull: In pub–sub systems like AWS SNS, messages are pushed to subscribers rather than being pulled.

Answer 4

- Loose Coupling: Producers and consumers do not interact directly; instead, they exchange messages via the queue. - Reliability: Even if one component is slow or fails, messages are safely stored until they can be processed.

Answer 5

Put: Producers send (or “put”) messages into the SQS queue. Get: Consumers pull (or “get”) messages from the queue. When a message is retrieved, it becomes temporarily invisible to other consumers. Delete: Once a consumer has successfully processed the message, it sends a delete command to permanently remove the message from the queue.

Answer 6

At-Least-Once Delivery (Standard Queues): A message is delivered at least once. If a consumer fails to delete it (for example, due to processing errors), the message becomes visible again for reprocessing. Exactly-Once Processing (FIFO Queues): FIFO (First-In, First-Out) queues aim to deliver each message exactly once with guaranteed order. This is achieved using content-based deduplication (using an SHA-256 hash of the message body). Note that exactly-once semantics are only achievable under limited conditions and come with trade-offs such as reduced throughput.

Answer 7

Short Pull: Consumers quickly check for new messages. If none are available, the call returns empty immediately, which might lead to frequent polling and increased costs. Long Pull: Consumers can set a waiting period (up to 20 seconds) during which the request will wait for a message to become available. This reduces the number of requests and helps control costs.

Answer 8

Visibility Timeout: When a consumer retrieves a message, it becomes invisible for a set period. If the consumer fails to process and delete the message within this timeout, it becomes visible again for other consumers. Dead Letter Queue (DLQ): After a message fails processing a predetermined number of times, it can be moved to a DLQ. This allows for isolation and troubleshooting of problematic messages without blocking the main queue.

Answer 9

A topic is a communication channel to which publishers send messages. Subscribers register with topics to receive notifications. This setup supports a one-to-many delivery model where a single published message is pushed to all subscribed endpoints. Push Model: Unlike SQS, SNS actively pushes messages to subscribers instead of waiting for consumers to poll.

Answer 10

Subject: A brief title or summary of the message. Time to Live (TTL): Specifies how long a message is valid. If the message is not delivered within this time, it is dropped. Payload: The main content of the message, which can be customized per endpoint (allowing for different formats depending on whether the subscriber is an email, HTTP endpoint, etc.).

Answer 11

Definition: Apache Kafka is a distributed event streaming platform designed for high-throughput, low-latency data pipelines and real-time processing. Its ability to decouple data producers from consumers and handle massive message volumes makes it ideal for modern microservices and big data architectures. Characteristics: Horizontally scales by distributing topics across multiple brokers and partitions. Messages are stored in an immutable, append-only log with configurable replication to protect against failures. Supports both publish–subscribe (pub/sub) and queue-based messaging patterns, enabling asynchronous communication across systems. Optimized for handling millions of messages per second, making it suitable for real-time analytics, ETL pipelines, and data lake ingestion.

Answer 12

Topics: Logical channels that group related messages (e.g., “user_actions” or “inventory_updates”). Partitions: Each topic is split into one or more partitions—each an ordered, immutable, append-only log. This design enables parallelism since different partitions can be hosted on different brokers. Offsets: Each message within a partition is assigned a unique, sequential offset. Consumers use these offsets to track their reading progress. Offsets are critical for replayability: if a consumer fails or needs to reprocess data, it can restart from a known offset without data loss. Scalable message: This architecture supports scalable message storage by allowing multiple consumers to process data in parallel and by letting topics grow independently of consumer speed.

Answer 13

Producers: Role: Applications that send (publish) messages to Kafka topics. Mechanism: Use the Producer API to send records and can control aspects like batching, compression, and acknowledgment settings. Partitioning: Producers can specify which partition a message should go to or let Kafka’s default strategies (e.g., round-robin or key-based hashing) determine the partition. Consumers: Role: Applications that subscribe to topics and process incoming messages. Consumer Groups: Consumers typically work as part of a group so that each partition is processed by only one consumer at a time. This ensures load balancing and parallel processing. Offset Management: Kafka tracks consumer offsets (often stored in an internal topic) to ensure that each consumer resumes reading from the correct position.

Answer 14

Round-Robin: When no key is provided, messages are evenly distributed across partitions. Hash-Based: When a key is provided, Kafka hashes the key to determine the partition—ensuring that messages with the same key (e.g., user ID) always go to the same partition, which helps preserve ordering. Custom Partitioners: Developers can implement custom partitioning logic to meet specific requirements (e.g., grouping related events together or achieving better load balance). Pros: Ensures parallelism, load distribution, and ordering when needed. Cons: Using key-based partitioning may lead to “hot partitions” if a particular key generates a large volume of messages. Increasing partitions later may also break ordering guarantees for keys.

Answer 15

Time-Based Retention: Messages can be configured to be retained for a set period (e.g., 7 days). Size-Based Retention: Retention can also be based on log size (e.g., 10 GB per topic). Trade-Offs: Longer Retention: Enables reprocessing and serves use cases like auditing and analytics but requires more storage. Shorter Retention: Conserves storage but may limit the ability to replay events in case of consumer failure or when historical analysis is needed.

Answer 16

At-Least-Once Processing Default Behavior: Guarantees that messages are never lost, but duplicates may occur in case of consumer reprocessing or network issues. Handling Duplicates: Requires idempotent consumer logic or external deduplication strategies (e.g., using an idempotency key stored in a cache or database). Exactly-Once Processing Mechanism: Utilizes Kafka transactions, idempotent producers, and Kafka Streams to ensure each message is processed only once—even in failure scenarios. Trade-Off: Provides stronger consistency and eliminates duplicates, but introduces additional coordination overhead, potentially reducing throughput.

Answer 17

Built on Kafka’s producer and consumer APIs, it allows you to process data in real time directly from Kafka. Ideal for building applications that require real-time analytics, event-driven processing, or stateful transformations without needing a separate processing cluster Operations Supported: Stateless: Simple mappings, filtering, etc. Stateful: Windowing (tumbling, hopping, sliding), aggregations, and joins. Fault Tolerance: Maintains local state with changelog topics for recovery.

Answer 18

Kafka: Architecture: Log-based storage with partitioning and replication. Strengths: High throughput, low latency, built-in fault tolerance, and replay capabilities. Best For: Streaming data pipelines, event-driven architectures, and large-scale log aggregation. RabbitMQ: Model: Traditional queue model with complex routing capabilities. Strengths: Excellent for flexible routing and scenarios requiring complex message patterns. Best For: Asynchronous workloads that need sophisticated routing. Amazon SQS: Model: Fully managed, serverless message queue service with simplicity in mind. Strengths: Removes infrastructure management and is easy to integrate with AWS services. Best For: Smaller-scale or simpler use cases where managed service benefits are desired.

Answer 19

Event-Driven Architectures: Decouple microservices by streaming events between them. Real-Time Analytics: Build dashboards and monitoring systems that process and analyze data on the fly. ETL Pipelines: Extract, transform, and load data into data lakes or warehouses continuously. Data Lake Ingestion: Stream data from various sources (applications, sensors, IoT devices) into a centralized repository. Log Aggregation & Monitoring: Collect and process log data for system monitoring and troubleshooting.

Answer 20

Managed Kafka services (e.g., AWS MSK, Azure Event Hubs, Google Cloud Pub/Sub) provide a cloud-based Kafka deployment where infrastructure management, scaling, and maintenance are handled by the provider. Key Benefits Simplified Deployment: No need to set up and manage Kafka clusters manually. Automated Scaling & Monitoring: Cloud services automatically adjust resources based on workload. High Availability & Security: Built-in replication, multi-AZ deployments, and robust security features (e.g., TLS, IAM integration). Cost Efficiency: Pay-as-you-go pricing and reduced operational overhead.

Answer 21

Partitioning Strategy: Right Sizing: Plan the number of partitions based on expected load and consumer counts. Balancing Act: More partitions increase parallelism but add overhead; too few may limit throughput. Monitoring Consumer Lag: Why It Matters: High consumer lag can indicate bottlenecks or processing delays. Tools: Use built-in metrics and external monitoring (e.g., CloudWatch, Prometheus) to keep track of performance. Integration with Cloud Services: Managed Services: Leverage managed Kafka offerings to reduce operational complexity. Configuration Tuning: Adjust producer batch sizes, compression, consumer fetch sizes, and JVM settings to maximize performance. Best Practices: Regularly test and adjust your partitioning strategy in staging environments before rolling changes to production. Balance throughput with latency requirements, especially when using key-based partitioning which can affect ordering.

Week 7: Scalable Data Storage Flashcards

(46 cards)