Data Pipelines on cloud Streaming Flashcards
Log
Append-only data structure, applications ignore details of source
Unifed log
collect events from many source systems, enable applications to operate on these event streams as they wish
events automatically deleted after certain time, read once
Distributed Unified Log
Log lives across a cluster of machines
Good for scalability and durability (replication)
Ordered events
Events in a shard (partition of unified log) have sequential IDs unique to their shard, meaning the ordering is local
Single-Event processing
Single event produces zero or more events
Multiple-event processing
multiple events collectively produce zero or more events.
Aggregate events, pattern match, reorder
Amazon Kinesis Data Streams
real-time streaming service that allows the ingestion and processing of large data streams. Composed of shards and can be scaled by splitting or merging shards
Shard
Unit of capacity in Kinesis data streams.
Each shard provides 1MBps of data input
Partition Key
User-defined key that determines how records are distributed across shards in Kinesis stream (load balancing support)
Re-sharding
Process of splitting or merging shards to scale a data stream either up or down
Data Blob
The actual data being streamed through Kinesis
Sequence Number
Number assigned to each record by the shard to maintain order within shard
Retention period
Maximum amount of time (up to 7 days) that data can be stored in Kinesis before being deleted
Kinesis Data Firehose
Service that automatically delivers streaming data to AWS services like S3, Redshidt, ElasticSearch
AWS Lambda
Serverless computing (FaaS), used to build modular back-end systems, can process streaming data without need to manage any server infrastructure (scales automatically with size of data stream, paying only for compute time, event-driven)