Data Engineering Flashcards
S3 key ?
it’s the
bucket name all the way to file extention
max S3 file size?
5TB
S3 object tag use cases?
It’s a key/value thing
Lifecycle
Classify data
Security
Storage part is S3. Name some computing?
EC2 Amazon Athena Amazon Redshift Spectrum Rekognition AWS Glue
Data partitioning on S3. how and why?
S3:/bucket/partions(year)….
to speed up range queries
S3 Encryption options
SSE-S3
SSE-KMS
SSE-C
CSE
S3 Access
User based:
- IAM
Resource Based
- Overall bucket policy
- ACL
What if we do not want to move the data in S3 over the internet?
use VPC Endpoint Gateway
S3 logs
S3 Access logs in another S3 bucket
API calls in CloudTrail
Can you do S3 policy based on the tags?
yes you do.
add tag classification=PHI
and impose the restriction on whatever file has this tag
Apache alternative of Kinesis
Kafka
Kinesis use cases
Logs
Metrics
IoT
ClickStream
Some streaming frameworks
Spark
NiFi
etc…
Kinesises
KDS: Low latency stream at scale
KDA: real-time analytics on stream using sql
KF: load stream into S3, Redshift, ES, Splunk
KVS: stream video in real-time
KDS Facts:
- provision
- retention
- replay data
- consumer quantity
- edit ingested data
- record size
- provision Shards in advance
- retention 24h to 7 days
- Ability to reprocess and replay data
- multiple consumer take off the same stream
- immutable
- 1MB
KDS Producer limits
1MB or 1000 messages /shard
Consumer Classic limits
2MB/s/Shard
5 API calls/s/Shard across all consumers
KDF min latency
near real-time
60 seconds
KDF targets
Redshift
Amazon S3
ElasticSearch
Splunk
KDF scaling
Managed Auto-Scaling
KDF Data conversion
CSV / JSON > Parquet / ORC
only for S3
KDF Data Transformation
using Lambda
CSV to JSON
KDF Compression
when target is S3
GZIP, ZIP, SNAPPY
KDF Pricing
Pay as you go
KDF Sources
SDK KPL Kinesis Agent KDS CloudWatch logs and events IoT rule actions
KDS Latency
70ms (enhanced fan-out)
200ms
KDF vs KDS
Ingestion (delivery) vs Streaming
Anomaly detection
RANDOM_CUT_FOREST
it uses recent history
Detect Dense Areas
HOTSPOTS
locate and return information about relatively dense areas
Runtime options on Kinesis Data Analytics
SQL
Apache Flink
Every Kinesis Video Stream is capable of receiving how many video inputs ?
Just one
1000 cameras ? run 1000 KVS
KVS Inputs?
Cameras AWS DeepLense Smartphone camera Audio feed Images RTSP camera Producer SDK
KVS Targets?
SageMaker Amazon Rekognition Video EC2 Consumer - Tensorflow - MXNet
KVS data retention
1 hour to 10 years
Fargate ?
Runs containers and scaleeee
KVS use cases
feed camera to KVS
run a container on Fargate
- use DynamoDB for checkpointing
Get decoded frames to SageMaker for ML inference
Publish the result to KDS
Fire off Lambda to e.g. send notifications
Which AWS services can use Glue Data Catalogue?
Amazon Redshift
Amazon Athena > QuickSight
Amazon EMR
Glue, does that transform as well somehow ?
yes Glue does Transformation Cleaning Enrich Data
using ETL code in Python or Scala
or Provide Spark or PySpark
Glue Targets
S3
JDBC (RDS, Redshift)
Glue Data Catalogue
Glue Cost modeling
Pay as you go for the resources consumed
Where is Glue running?
a Serverless Spark Platform
Glue Scheduler?
it does scheduling the jobs
Glue Triggers?
it automate job runs based on events
Glue Transformations, how?
Bundled:
- DropFields, DropNullFields - remove null values
- Filter to filter records
- Join to enrich
- Map to add fields, delete fields, perform external lookup
Machine Learning Transformations:
- Identify duplicates or matching records
Format Conversions:
CSV, JSON, Avro, ORC, Parquet, XML
Apache Spark Transformations:
K-Means
Glue Job types
Spark - Python 2 - Python 3 - Scala Python Shell
Name some AWS Storage services
Redshift
- Columnar
- SQL
- OLAP
- Load from S3
- Redshift Spectrum runs on S3 without loading
RDS
- OLTP
- SQL
DynamoDB
- NoSQL
S3
- Object Storage
ElasticSearch
- Search amongst data points
- indexing
- Clickstream analytics
ElastiCache:
- Caching and in-memory
AWS Data Pipeline Source/ Destinations
include:
- S3
- RDS
- Redshift
- DynamoDB
- EMR
Datasource may be on-premise
Glue vs Data Pipeline
Glue
- Managed
- Run Apache Spark, Scala, Python
- Focus on ETL and not configuration or managing resources
Data Pipeline
- Orchestration service
- Gives more control over environment, codes, ec2…
- allow access to EC2 or EMR instances
AWS Batch?
Run batch jobs as Docker images
Dynamic provisioning of the instances (EC2 & Spot)
Optimal quantity
No need to manage clusters, fully serverless
Pay for EC2 instances
How to schedule Batch jobs?
using CloudWatch events
How to orchestrate Batch jobs?
using AWS Step functions
AWS Batch vs Glue
Resources are created in the account by Batch
Docker image must be provided
Batch is better for non-ETL related work
- e.g. Cleaning an S3 bucket
Glue better for ETL and Transformation
AWS DMS, does the source remains available during the job?
yes
DMS vs Glue?
Glue min every 5min
Data Migration Services (DMS) Real-time
DMS doesn’t do much transform
How DMS is real-time?
it uses Continuous Data Replication (CDR)
AWS Step functions?
Design Workflows
Easy Visualization
Error Handling and retry
Audit the history
Option to wait for an arbitrary amount of time
Max execution of a state machine is 1 year
AWS Services for any sort of ETL?
Glue
Batch
Data Pipeline
Step functions (to orchestrate)