Data Engineering Flashcards
1
Q
Kinesis - Data Streaming
A
- Managed, high scale and real time
- Replicated sync to 3 AZ
2
Q
Kinesis Streams
A
Low latency streaming ingest at Scale,
- Not managed, not serverless
- Shards (have to be provisioned in advance)
- Need custom code for producers/consumers
- Data retention 24hrs(default) up to 7 days, for longer storage use KDF to store to S3
- Can replay/reprocess
- Multiple applications can consume from same stream
- Once inserted cannot be deleted (immutable)
3
Q
Kinesis Analytics
A
Perform real-time analytics on streams using SQL
4
Q
Kinesis Firehouse
A
Load streams into S3, Redshift, ES, Splunk ONLY
5
Q
Kinesis Streams Shards
A
- The more the better scale
- One stream made of many shards
- Billing per shards
- Batching is supported
- Add add remove shards at any time
- Records are ordered per shard, but not across shards
6
Q
Kinesis Producers
A
- AWS SDK, simple producer
- Kinesis Producer Library, batch compression, retries
– Application level, supports Java, C++ - Kinesis Agent
– Instance level, send log file directly to:
All can send directly to either:
— Kinesis Streams
— Kinesis Firehose
7
Q
Kinesis Consumers
A
- AWS SDK, simple consumer
- Lambda
- Kinesis Consumer Library
- Checkpointing, coordinated reads
8
Q
Kinesis Producer Limits
A
- 1MB/s or 1000m/s throughput PER shard
- - Otherwise ProvisionedThroughPutException
9
Q
Kinesis Consumer Limits
A
Classic:
- 2MB read PER shard across all consumers
- 5 API calls per second PER shard across all consumers
- ~200ms latency
Enhanced Fan-Out:
- 2 MB read PER shared PER enhanced consumer
- No API calls needed (push model)
- ~70ms latency
10
Q
Kinesis Firehouse
A
- Managed, auto-scaling, serverless
- Near real-time (60 seconds latency min for non full batches)
- Supports many data formats, conversions, transformations, compression, using lambda
- No data stored, no replay
11
Q
Kinesis Firehose Billing
A
- Pay for the amount of data going through
12
Q
Kinesis Firehose Use Case
A
- To go to Redshift, have to output to S3 bucket then copy to Redshift from it
- Can send to Kinesis Data Analytics
- Also can store in other S3 bucket:
- Source records
- Transformation Failures
- Delivery Failures
13
Q
Kinesis Firehose Buffer
A
- Flushed based on time and size rules
- High throughput, size buffer will be hit
- Low throughput, time buffer will be hit
- Can automatically scale the buffer during high throughput
- If real-time flush from streams to S3 needed, use lambda
14
Q
Kinesis Analytics Use Case
A
- Can have both KStreams and KFirehose as inputs
- Reference Data - optional static reference table
- SQL to aggregate
- Produces output and error streams
- Output stream KStream or KFirehose
15
Q
Kinesis Analytics
A
- Serverless
- Only pay for resources consumed (not cheap)
- Use IAM to access streaming sources and destinations
- SQL or Flink
- Schema discovery
- Lambda for pre-processing
16
Q
Streaming 3000 messages of 1KB per second
Possible Architectures
A
- Kinesis Data Streams -> Lambda - Cheaper
- DDB + DDB Streams -> Lambda - Expensive