Low latency streaming ingest at Scale, - Not managed, not serverless - Shards (have to be provisioned in advance) - Need custom code for producers/consumers - Data retention 24hrs(default) up to 7 days, for longer storage use KDF to store to S3 - Can replay/reprocess - Multiple applications can consume from same stream - Once inserted cannot be deleted (immutable)

- AWS SDK, simple producer - Kinesis Producer Library, batch compression, retries -- Application level, supports Java, C++ - Kinesis Agent -- Instance level, send log file directly to: All can send directly to either: --- Kinesis Streams --- Kinesis Firehose

- AWS SDK, simple consumer - Lambda - Kinesis Consumer Library - - Checkpointing, coordinated reads

- Managed, auto-scaling, serverless - Near real-time (60 seconds latency min for non full batches) - Supports many data formats, conversions, transformations, compression, using lambda - No data stored, no replay

- Serverless - Only pay for resources consumed (not cheap) - Use IAM to access streaming sources and destinations - SQL or Flink - Schema discovery - Lambda for pre-processing

Data Engineering Flashcards by Gaukhar Aya

Kinesis - Data Streaming

Managed, high scale and real time

- Replicated sync to 3 AZ

How well did you know this?

Not at all

Perfectly

Kinesis Streams

Low latency streaming ingest at Scale,

Not managed, not serverless
Shards (have to be provisioned in advance)
Need custom code for producers/consumers
Data retention 24hrs(default) up to 7 days, for longer storage use KDF to store to S3
Can replay/reprocess
Multiple applications can consume from same stream
Once inserted cannot be deleted (immutable)

How well did you know this?

Not at all

Perfectly

Kinesis Analytics

Perform real-time analytics on streams using SQL

How well did you know this?

Not at all

Perfectly

Kinesis Firehouse

Load streams into S3, Redshift, ES, Splunk ONLY

How well did you know this?

Not at all

Perfectly

Kinesis Streams Shards

The more the better scale
One stream made of many shards
Billing per shards
Batching is supported
Add add remove shards at any time
Records are ordered per shard, but not across shards

How well did you know this?

Not at all

Perfectly

Kinesis Producers

AWS SDK, simple producer
Kinesis Producer Library, batch compression, retries
– Application level, supports Java, C++
Kinesis Agent
– Instance level, send log file directly to:
All can send directly to either:
— Kinesis Streams
— Kinesis Firehose

How well did you know this?

Not at all

Perfectly

Kinesis Consumers

AWS SDK, simple consumer
Lambda
Kinesis Consumer Library
- Checkpointing, coordinated reads

How well did you know this?

Not at all

Perfectly

Kinesis Producer Limits

1MB/s or 1000m/s throughput PER shard

- - Otherwise ProvisionedThroughPutException

How well did you know this?

Not at all

Perfectly

Kinesis Consumer Limits

Classic:

2MB read PER shard across all consumers
5 API calls per second PER shard across all consumers
~200ms latency

Enhanced Fan-Out:

2 MB read PER shared PER enhanced consumer
No API calls needed (push model)
~70ms latency

How well did you know this?

Not at all

Perfectly

Kinesis Firehouse

Managed, auto-scaling, serverless
Near real-time (60 seconds latency min for non full batches)
Supports many data formats, conversions, transformations, compression, using lambda
No data stored, no replay

How well did you know this?

Not at all

Perfectly

Kinesis Firehose Billing

Pay for the amount of data going through

How well did you know this?

Not at all

Perfectly

Kinesis Firehose Use Case

To go to Redshift, have to output to S3 bucket then copy to Redshift from it
Can send to Kinesis Data Analytics
Also can store in other S3 bucket:
- Source records
- Transformation Failures
- Delivery Failures

How well did you know this?

Not at all

Perfectly

Kinesis Firehose Buffer

Flushed based on time and size rules
High throughput, size buffer will be hit
Low throughput, time buffer will be hit
Can automatically scale the buffer during high throughput
If real-time flush from streams to S3 needed, use lambda

How well did you know this?

Not at all

Perfectly

Kinesis Analytics Use Case

Can have both KStreams and KFirehose as inputs
Reference Data - optional static reference table
SQL to aggregate
Produces output and error streams
Output stream KStream or KFirehose

How well did you know this?

Not at all

Perfectly

Kinesis Analytics

Serverless
Only pay for resources consumed (not cheap)
Use IAM to access streaming sources and destinations
SQL or Flink
Schema discovery
Lambda for pre-processing

How well did you know this?

Not at all

Perfectly

Streaming 3000 messages of 1KB per second

Possible Architectures

Kinesis Data Streams -> Lambda - Cheaper

- DDB + DDB Streams -> Lambda - Expensive

How well did you know this?

Not at all

Perfectly

AWS Batch

Study These Flashcards

Serverless, but pay for underlying EC2 or Spot instances
Dynamic provisioning of instances based on requirements
Single job per container, not coordinated by default

AWS Batch Use Cases

Study These Flashcards

Batch process of images

- Running thousands of concurrent jobs

AWS Batch Triggers

Study These Flashcards

Cron job (or any other) through CW events
Step Functions to orchestrate batch jobs
S3 Event calling lambda that makes API call
SDK

Lambda vs Batch

Study These Flashcards

Lambda:

Limited execution time
Limited runtimes
Limited temp disk space
Serverless

Batch:

No time limit
Any runtime packaged in Docker image
Storage EBS, Instance Store defined
Not fully managed (EC2-ECS)

Lambda Limitations

Study These Flashcards

AWS Lambda is limited to 512 MB of ephemeral storage mounted in /tmp/
The default deployment package size is 50 MB. Memory range is from 128 to 1536 MB. Maximum execution timeout for a function is 15 minutes. Requests limitations by lambda: Request and response body payload size are maximized to 6 MB.

Batch Compute Environments

Study These Flashcards

Managed

Capacity and instance types within the env, eg auto-scaling
Can choose on-demand or spot
Can set max spot price, min/max vCPU
Launched within your own VPC
- If launched within your own private subnet make sure it has access to ECS Service
– Through Nat Gateway/Instance or VPCE for ECS

Unmanaged
- Control and manage instance configuration, provisioning and scaling

AWS Batch - Multi Node

Study These Flashcards

Large scale, HPC, tightly coupled workloads
Multiple EC2/ECS instances, but single job
Coordinated, eg parent and child nodes
Does not work with Spot Instances
Works better if EC2 launch mode is cluster, for low latency networking
Highly parallel
No need to launch, configure, and manage Amazon EC2 resources directly.

Elastic Map Reduce

Study These Flashcards

Creates Hadoop Clusters, configuration & provisioning
Good for migrating on-prem clusters
100s of EC2 instances, do workload -> shutdown, save $
Auto-scaling using CW

EMR Integrations

- Launched within VPC, single AZ, for better latency - DDB through Hive - EBS for temp storage - EMRFS (S3) permanent storage, server-side encryption

EMR Cost Optimisation - Purchasing Options

- On-demand - Reserved - Spot - One cluster many jobs, with auto scaling - One cluster per job - Uniform instance groups - select single instance type and purchasing option for each node (Auto-scaling) - Instance fleet - select target capacity, mix instance types and purchasing options (No auto-scaling)

EMR Node Types

- Master - manage cluster, orchestrate, manage health - Core - run tasks, store data - Task - run tasks (optional)

Running Jobs on AWS - Use Cases

- EC2, cron job - long running, not HA, not scaleable, hard to monitor - Scheduled CW event + lambda - more scalable, more visibility, limited by runtime, executi timeon - Reactive CW event/ S3 event/API Gateway/SQS/SNS + lambda - CW Events + Batch (does away with lambda limits) - CW Events + Fargate (no need to manage EC2 level infrastructure like in Batch) - EMR

Redshift - Data Warehousing

- Based on Psql, not used for OLTP, instead OLAP for Analytics - Column based, SQL - Provisioned, not serverless. Not good for sporadic use

Redshift Nodes

- Leader - query planning, results aggregation | - Compute - perform queries, send results to leader

Redshift Integrations

- S3, Kinesis Firehouse, DDB, DMS | - Quicksight and Tableu to make dashboards

Redshift Scale

- 10x better performance, scale to PBs of data, parallel query engine - 1 to 128 nodes up to 160 GB per node

Redshift Networking

- Not multi AZ for enhanced latency - Enhanced VPC Routing: COPY/UNLOAD goes through VPC flow logs. Talks to service over private AWS networking, faster, cheaper

Redshift Backup and Restore

- Backup, PIT snapshots incremental, stored in S3 - - Automated: every 8 hours or every 5GB(default) or scheduled. Set retention - - Manual: retained until deleted - Restore, snapshots to restore into a new cluster - Snapshots can be copied automatically to another region for DR

Redshift Integration with AWS Management Services

- IAM - KMS - Monitoring - VPC Security

Redshift Spectrum

- Serverless query S3 data without loading data into Redshift - Requires Redshift cluster though

Athena

- Serverless SQL queries against S3 data - Pay per query - Output to S3 - Supports CSV, JSON, Parquet, ORC - Queries are logged in CT -> CW

Athena Pre Built Query Examples

- VPC Flow Logs - CloudTrail - ALB Access Logs - Cost and Usage Reports, etc

Quicksight

- Data BI - Integrates with Athena, RDS, Redshift, EMR - Uses Athena as backend query engine - External with Salesforce, Teradata, Jira, Excel

S3 events targets

Only SNS, SQS, Lambda

Data Engineering Flashcards

(40 cards)