Data Engineering Flashcards

1
Q

Kinesis - Data Streaming

A
  • Managed, high scale and real time

- Replicated sync to 3 AZ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis Streams

A

Low latency streaming ingest at Scale,

  • Not managed, not serverless
  • Shards (have to be provisioned in advance)
  • Need custom code for producers/consumers
  • Data retention 24hrs(default) up to 7 days, for longer storage use KDF to store to S3
  • Can replay/reprocess
  • Multiple applications can consume from same stream
  • Once inserted cannot be deleted (immutable)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Kinesis Analytics

A

Perform real-time analytics on streams using SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Kinesis Firehouse

A

Load streams into S3, Redshift, ES, Splunk ONLY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Kinesis Streams Shards

A
  • The more the better scale
  • One stream made of many shards
  • Billing per shards
  • Batching is supported
  • Add add remove shards at any time
  • Records are ordered per shard, but not across shards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Kinesis Producers

A
  • AWS SDK, simple producer
  • Kinesis Producer Library, batch compression, retries
    – Application level, supports Java, C++
  • Kinesis Agent
    – Instance level, send log file directly to:
    All can send directly to either:
    — Kinesis Streams
    — Kinesis Firehose
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Kinesis Consumers

A
  • AWS SDK, simple consumer
  • Lambda
  • Kinesis Consumer Library
    • Checkpointing, coordinated reads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kinesis Producer Limits

A
  • 1MB/s or 1000m/s throughput PER shard

- - Otherwise ProvisionedThroughPutException

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kinesis Consumer Limits

A

Classic:

  • 2MB read PER shard across all consumers
  • 5 API calls per second PER shard across all consumers
  • ~200ms latency

Enhanced Fan-Out:

  • 2 MB read PER shared PER enhanced consumer
  • No API calls needed (push model)
  • ~70ms latency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Kinesis Firehouse

A
  • Managed, auto-scaling, serverless
  • Near real-time (60 seconds latency min for non full batches)
  • Supports many data formats, conversions, transformations, compression, using lambda
  • No data stored, no replay
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kinesis Firehose Billing

A
  • Pay for the amount of data going through
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Kinesis Firehose Use Case

A
  • To go to Redshift, have to output to S3 bucket then copy to Redshift from it
  • Can send to Kinesis Data Analytics
  • Also can store in other S3 bucket:
    • Source records
    • Transformation Failures
    • Delivery Failures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Kinesis Firehose Buffer

A
  • Flushed based on time and size rules
  • High throughput, size buffer will be hit
  • Low throughput, time buffer will be hit
  • Can automatically scale the buffer during high throughput
  • If real-time flush from streams to S3 needed, use lambda
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Kinesis Analytics Use Case

A
  • Can have both KStreams and KFirehose as inputs
  • Reference Data - optional static reference table
  • SQL to aggregate
  • Produces output and error streams
  • Output stream KStream or KFirehose
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Kinesis Analytics

A
  • Serverless
  • Only pay for resources consumed (not cheap)
  • Use IAM to access streaming sources and destinations
  • SQL or Flink
  • Schema discovery
  • Lambda for pre-processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Streaming 3000 messages of 1KB per second

Possible Architectures

A
  • Kinesis Data Streams -> Lambda - Cheaper

- DDB + DDB Streams -> Lambda - Expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

AWS Batch

A
  • Serverless, but pay for underlying EC2 or Spot instances
  • Dynamic provisioning of instances based on requirements
  • Single job per container, not coordinated by default
18
Q

AWS Batch Use Cases

A
  • Batch process of images

- Running thousands of concurrent jobs

19
Q

AWS Batch Triggers

A
  • Cron job (or any other) through CW events
  • Step Functions to orchestrate batch jobs
  • S3 Event calling lambda that makes API call
  • SDK
20
Q

Lambda vs Batch

A

Lambda:

  • Limited execution time
  • Limited runtimes
  • Limited temp disk space
  • Serverless

Batch:

  • No time limit
  • Any runtime packaged in Docker image
  • Storage EBS, Instance Store defined
  • Not fully managed (EC2-ECS)
21
Q

Lambda Limitations

A

AWS Lambda is limited to 512 MB of ephemeral storage mounted in /tmp/
The default deployment package size is 50 MB. Memory range is from 128 to 1536 MB. Maximum execution timeout for a function is 15 minutes. Requests limitations by lambda: Request and response body payload size are maximized to 6 MB.

22
Q

Batch Compute Environments

A

Managed

  • Capacity and instance types within the env, eg auto-scaling
  • Can choose on-demand or spot
  • Can set max spot price, min/max vCPU
  • Launched within your own VPC
    • If launched within your own private subnet make sure it has access to ECS Service
  • – Through Nat Gateway/Instance or VPCE for ECS

Unmanaged
- Control and manage instance configuration, provisioning and scaling

23
Q

AWS Batch - Multi Node

A
  • Large scale, HPC, tightly coupled workloads
  • Multiple EC2/ECS instances, but single job
  • Coordinated, eg parent and child nodes
  • Does not work with Spot Instances
  • Works better if EC2 launch mode is cluster, for low latency networking
  • Highly parallel
  • No need to launch, configure, and manage Amazon EC2 resources directly.
24
Q

Elastic Map Reduce

A
  • Creates Hadoop Clusters, configuration & provisioning
  • Good for migrating on-prem clusters
  • 100s of EC2 instances, do workload -> shutdown, save $
  • Auto-scaling using CW
25
Q

EMR Integrations

A
  • Launched within VPC, single AZ, for better latency
  • DDB through Hive
  • EBS for temp storage
  • EMRFS (S3) permanent storage, server-side encryption
26
Q

EMR Cost Optimisation - Purchasing Options

A
  • On-demand
  • Reserved
  • Spot
  • One cluster many jobs, with auto scaling
  • One cluster per job
  • Uniform instance groups - select single instance type and purchasing option for each node (Auto-scaling)
  • Instance fleet - select target capacity, mix instance types and purchasing options (No auto-scaling)
27
Q

EMR Node Types

A
  • Master - manage cluster, orchestrate, manage health
  • Core - run tasks, store data
  • Task - run tasks (optional)
28
Q

Running Jobs on AWS - Use Cases

A
  • EC2, cron job - long running, not HA, not scaleable, hard to monitor
  • Scheduled CW event + lambda - more scalable, more visibility, limited by runtime, executi timeon
  • Reactive CW event/ S3 event/API Gateway/SQS/SNS + lambda
  • CW Events + Batch (does away with lambda limits)
  • CW Events + Fargate (no need to manage EC2 level infrastructure like in Batch)
  • EMR
29
Q

Redshift - Data Warehousing

A
  • Based on Psql, not used for OLTP, instead OLAP for Analytics
  • Column based, SQL
  • Provisioned, not serverless. Not good for sporadic use
30
Q

Redshift Nodes

A
  • Leader - query planning, results aggregation

- Compute - perform queries, send results to leader

31
Q

Redshift Integrations

A
  • S3, Kinesis Firehouse, DDB, DMS

- Quicksight and Tableu to make dashboards

32
Q

Redshift Scale

A
  • 10x better performance, scale to PBs of data, parallel query engine
  • 1 to 128 nodes up to 160 GB per node
33
Q

Redshift Networking

A
  • Not multi AZ for enhanced latency
  • Enhanced VPC Routing: COPY/UNLOAD goes through VPC flow logs. Talks to service over private AWS networking, faster, cheaper
34
Q

Redshift Backup and Restore

A
  • Backup, PIT snapshots incremental, stored in S3
    • Automated: every 8 hours or every 5GB(default) or scheduled. Set retention
    • Manual: retained until deleted
  • Restore, snapshots to restore into a new cluster
  • Snapshots can be copied automatically to another region for DR
35
Q

Redshift Integration with AWS Management Services

A
  • IAM
  • KMS
  • Monitoring
  • VPC Security
36
Q

Redshift Spectrum

A
  • Serverless query S3 data without loading data into Redshift
  • Requires Redshift cluster though
37
Q

Athena

A
  • Serverless SQL queries against S3 data
  • Pay per query
  • Output to S3
  • Supports CSV, JSON, Parquet, ORC
  • Queries are logged in CT -> CW
38
Q

Athena Pre Built Query Examples

A
  • VPC Flow Logs
  • CloudTrail
  • ALB Access Logs
  • Cost and Usage Reports, etc
39
Q

Quicksight

A
  • Data BI
  • Integrates with Athena, RDS, Redshift, EMR
  • Uses Athena as backend query engine
  • External with Salesforce, Teradata, Jira, Excel
40
Q

S3 events targets

A

Only SNS, SQS, Lambda