Data Pipelines on cloud Streaming Flashcards
Log
Append-only data structure, applications ignore details of source
Unifed log
collect events from many source systems, enable applications to operate on these event streams as they wish
events automatically deleted after certain time, read once
Distributed Unified Log
Log lives across a cluster of machines
Good for scalability and durability (replication)
Ordered events
Events in a shard (partition of unified log) have sequential IDs unique to their shard, meaning the ordering is local
Single-Event processing
Single event produces zero or more events
Multiple-event processing
multiple events collectively produce zero or more events.
Aggregate events, pattern match, reorder
Amazon Kinesis Data Streams
real-time streaming service that allows the ingestion and processing of large data streams. Composed of shards and can be scaled by splitting or merging shards
Shard
Unit of capacity in Kinesis data streams.
Each shard provides 1MBps of data input
Partition Key
User-defined key that determines how records are distributed across shards in Kinesis stream (load balancing support)
Re-sharding
Process of splitting or merging shards to scale a data stream either up or down
Data Blob
The actual data being streamed through Kinesis
Sequence Number
Number assigned to each record by the shard to maintain order within shard
Retention period
Maximum amount of time (up to 7 days) that data can be stored in Kinesis before being deleted
Kinesis Data Firehose
Service that automatically delivers streaming data to AWS services like S3, Redshidt, ElasticSearch
AWS Lambda
Serverless computing (FaaS), used to build modular back-end systems, can process streaming data without need to manage any server infrastructure (scales automatically with size of data stream, paying only for compute time, event-driven)
What are some of the operational and administrative activities that Lambda takes care of for you so you can focus on only your code?
Balances memory and CPU
Provisioning Capacity
Monitoring fleet health
Applying security patches
FaaS
Write single-purpose stateless functions
(function does just one thing! thats why its modular)
Data Pipeline Pattern
Architectural solution to problems in software design
Command Pattern
The Command Pattern encapsulates a request as an object, allowing it to be executed later or passed around in the system without knowing when or how it will be executed.
Pipes and filters pattern
It decomposes a complex process into a sequence of smaller, manageable steps (filters) connected by pipes, where each filter processes and passes data to the next
Messaging Pattern
It decouples different parts of a system by using a queue to send and receive messages, allowing independent processing and handling of tasks.
Priority queue pattern
It allows for tasks or messages to be prioritized, ensuring that high-priority items are processed first, while lower-priority tasks wait.