Data Engineering Flashcards
Amazon Kinesis Data Streams
Collect and store streaming data in real-time
1. Retention up to 365 days (can’t be deleted until it expires)
2. Data ordering guarantee
Kinesis Data Streams - Capacity Modes
Provisioned mode
- Choose number of shards
- Each shard gets 1000 records per second
- Scale manually
- Pay per shard provisioned per hour
On demand mode:
- Default capacity provisiones (4000 records per second)
- Scale based on observed throughput peaks
- Pay per stream per hour & data in/out
Amazon Data Firehose - AWS Destinations
- Amazon S3: Supports compression
- Amazon Redshift (copy through S3)
- Amazon OpenSearch
Amazon Data Firehose
- Receive records (up to 1MB) from producers
- Can make data transformation with Lambda functions
- Batch writes to destinations based on buffer time and size, Near Real Time
- All or failed data can be backup in S3
Firehose Buffer Sizing
Firehose accumulates records in a buffer and is flushed based on time and size rules
Amazon Managed Service for Apache Flink
Framework for processing data streams. Can’t read from Amazon Data Firehose.
Amazon Managed Streaming For Apache Kafka
Amazon MSK creates & manages Kafka brokers nodes.
* Deployed in your VPC, multi-AZ
* Data stores on EBS volumes
Kinesis Data Streams vs Amazon MSK
Kinesis Data Streams:
* 1 MB message size limit
* 12 months maximum retention
* Shard splitting and merging
Amazon MSK:
* Configure for bigger messages
* No retention limit
* Can only add partitions to a topic
AWS Batch
Run batch jobs as Docker images
AWS Batch - Multi node mode
Leverages multiple EC2 / ECS instances. One main node and multiple childs. Doesn’t work with Spot Instances.
Amazon Elastic MapReduce
EMR creates Hadoop clusters in a single AZ to analyze and process vast amount of data
EMR File System
EMRFS stores persistent data in Amazon S3 while providing data encryption
EMR node types
- Master node: Manage the cluster
- Core node: Run tasks and store data
- Task node: Justo to run tasks, usually uses spot
EMR instance configuration
- Uniform instance groups: select a single instance type and purchasing option for each node. Has auto scaling.
- Instance fleet: Select target capoacity, mix instance types and purchasing options. No auto scaling.
AWS Glue
Managed extract, transform and load service