Data Engineering Flashcards
Kinesis - Data Streaming
- Managed, high scale and real time
- Replicated sync to 3 AZ
Kinesis Streams
Low latency streaming ingest at Scale,
- Not managed, not serverless
- Shards (have to be provisioned in advance)
- Need custom code for producers/consumers
- Data retention 24hrs(default) up to 7 days, for longer storage use KDF to store to S3
- Can replay/reprocess
- Multiple applications can consume from same stream
- Once inserted cannot be deleted (immutable)
Kinesis Analytics
Perform real-time analytics on streams using SQL
Kinesis Firehouse
Load streams into S3, Redshift, ES, Splunk ONLY
Kinesis Streams Shards
- The more the better scale
- One stream made of many shards
- Billing per shards
- Batching is supported
- Add add remove shards at any time
- Records are ordered per shard, but not across shards
Kinesis Producers
- AWS SDK, simple producer
- Kinesis Producer Library, batch compression, retries
– Application level, supports Java, C++ - Kinesis Agent
– Instance level, send log file directly to:
All can send directly to either:
— Kinesis Streams
— Kinesis Firehose
Kinesis Consumers
- AWS SDK, simple consumer
- Lambda
- Kinesis Consumer Library
- Checkpointing, coordinated reads
Kinesis Producer Limits
- 1MB/s or 1000m/s throughput PER shard
- - Otherwise ProvisionedThroughPutException
Kinesis Consumer Limits
Classic:
- 2MB read PER shard across all consumers
- 5 API calls per second PER shard across all consumers
- ~200ms latency
Enhanced Fan-Out:
- 2 MB read PER shared PER enhanced consumer
- No API calls needed (push model)
- ~70ms latency
Kinesis Firehouse
- Managed, auto-scaling, serverless
- Near real-time (60 seconds latency min for non full batches)
- Supports many data formats, conversions, transformations, compression, using lambda
- No data stored, no replay
Kinesis Firehose Billing
- Pay for the amount of data going through
Kinesis Firehose Use Case
- To go to Redshift, have to output to S3 bucket then copy to Redshift from it
- Can send to Kinesis Data Analytics
- Also can store in other S3 bucket:
- Source records
- Transformation Failures
- Delivery Failures
Kinesis Firehose Buffer
- Flushed based on time and size rules
- High throughput, size buffer will be hit
- Low throughput, time buffer will be hit
- Can automatically scale the buffer during high throughput
- If real-time flush from streams to S3 needed, use lambda
Kinesis Analytics Use Case
- Can have both KStreams and KFirehose as inputs
- Reference Data - optional static reference table
- SQL to aggregate
- Produces output and error streams
- Output stream KStream or KFirehose
Kinesis Analytics
- Serverless
- Only pay for resources consumed (not cheap)
- Use IAM to access streaming sources and destinations
- SQL or Flink
- Schema discovery
- Lambda for pre-processing
Streaming 3000 messages of 1KB per second
Possible Architectures
- Kinesis Data Streams -> Lambda - Cheaper
- DDB + DDB Streams -> Lambda - Expensive
AWS Batch
- Serverless, but pay for underlying EC2 or Spot instances
- Dynamic provisioning of instances based on requirements
- Single job per container, not coordinated by default
AWS Batch Use Cases
- Batch process of images
- Running thousands of concurrent jobs
AWS Batch Triggers
- Cron job (or any other) through CW events
- Step Functions to orchestrate batch jobs
- S3 Event calling lambda that makes API call
- SDK
Lambda vs Batch
Lambda:
- Limited execution time
- Limited runtimes
- Limited temp disk space
- Serverless
Batch:
- No time limit
- Any runtime packaged in Docker image
- Storage EBS, Instance Store defined
- Not fully managed (EC2-ECS)
Lambda Limitations
AWS Lambda is limited to 512 MB of ephemeral storage mounted in /tmp/
The default deployment package size is 50 MB. Memory range is from 128 to 1536 MB. Maximum execution timeout for a function is 15 minutes. Requests limitations by lambda: Request and response body payload size are maximized to 6 MB.
Batch Compute Environments
Managed
- Capacity and instance types within the env, eg auto-scaling
- Can choose on-demand or spot
- Can set max spot price, min/max vCPU
- Launched within your own VPC
- If launched within your own private subnet make sure it has access to ECS Service
- – Through Nat Gateway/Instance or VPCE for ECS
Unmanaged
- Control and manage instance configuration, provisioning and scaling
AWS Batch - Multi Node
- Large scale, HPC, tightly coupled workloads
- Multiple EC2/ECS instances, but single job
- Coordinated, eg parent and child nodes
- Does not work with Spot Instances
- Works better if EC2 launch mode is cluster, for low latency networking
- Highly parallel
- No need to launch, configure, and manage Amazon EC2 resources directly.
Elastic Map Reduce
- Creates Hadoop Clusters, configuration & provisioning
- Good for migrating on-prem clusters
- 100s of EC2 instances, do workload -> shutdown, save $
- Auto-scaling using CW
EMR Integrations
- Launched within VPC, single AZ, for better latency
- DDB through Hive
- EBS for temp storage
- EMRFS (S3) permanent storage, server-side encryption
EMR Cost Optimisation - Purchasing Options
- On-demand
- Reserved
- Spot
- One cluster many jobs, with auto scaling
- One cluster per job
- Uniform instance groups - select single instance type and purchasing option for each node (Auto-scaling)
- Instance fleet - select target capacity, mix instance types and purchasing options (No auto-scaling)
EMR Node Types
- Master - manage cluster, orchestrate, manage health
- Core - run tasks, store data
- Task - run tasks (optional)
Running Jobs on AWS - Use Cases
- EC2, cron job - long running, not HA, not scaleable, hard to monitor
- Scheduled CW event + lambda - more scalable, more visibility, limited by runtime, executi timeon
- Reactive CW event/ S3 event/API Gateway/SQS/SNS + lambda
- CW Events + Batch (does away with lambda limits)
- CW Events + Fargate (no need to manage EC2 level infrastructure like in Batch)
- EMR
Redshift - Data Warehousing
- Based on Psql, not used for OLTP, instead OLAP for Analytics
- Column based, SQL
- Provisioned, not serverless. Not good for sporadic use
Redshift Nodes
- Leader - query planning, results aggregation
- Compute - perform queries, send results to leader
Redshift Integrations
- S3, Kinesis Firehouse, DDB, DMS
- Quicksight and Tableu to make dashboards
Redshift Scale
- 10x better performance, scale to PBs of data, parallel query engine
- 1 to 128 nodes up to 160 GB per node
Redshift Networking
- Not multi AZ for enhanced latency
- Enhanced VPC Routing: COPY/UNLOAD goes through VPC flow logs. Talks to service over private AWS networking, faster, cheaper
Redshift Backup and Restore
- Backup, PIT snapshots incremental, stored in S3
- Automated: every 8 hours or every 5GB(default) or scheduled. Set retention
- Manual: retained until deleted
- Restore, snapshots to restore into a new cluster
- Snapshots can be copied automatically to another region for DR
Redshift Integration with AWS Management Services
- IAM
- KMS
- Monitoring
- VPC Security
Redshift Spectrum
- Serverless query S3 data without loading data into Redshift
- Requires Redshift cluster though
Athena
- Serverless SQL queries against S3 data
- Pay per query
- Output to S3
- Supports CSV, JSON, Parquet, ORC
- Queries are logged in CT -> CW
Athena Pre Built Query Examples
- VPC Flow Logs
- CloudTrail
- ALB Access Logs
- Cost and Usage Reports, etc
Quicksight
- Data BI
- Integrates with Athena, RDS, Redshift, EMR
- Uses Athena as backend query engine
- External with Salesforce, Teradata, Jira, Excel
S3 events targets
Only SNS, SQS, Lambda