Data Engineering Flashcards
S3 key ?
it’s the
bucket name all the way to file extention
max S3 file size?
5TB
S3 object tag use cases?
It’s a key/value thing
Lifecycle
Classify data
Security
Storage part is S3. Name some computing?
EC2 Amazon Athena Amazon Redshift Spectrum Rekognition AWS Glue
Data partitioning on S3. how and why?
S3:/bucket/partions(year)….
to speed up range queries
S3 Encryption options
SSE-S3
SSE-KMS
SSE-C
CSE
S3 Access
User based:
- IAM
Resource Based
- Overall bucket policy
- ACL
What if we do not want to move the data in S3 over the internet?
use VPC Endpoint Gateway
S3 logs
S3 Access logs in another S3 bucket
API calls in CloudTrail
Can you do S3 policy based on the tags?
yes you do.
add tag classification=PHI
and impose the restriction on whatever file has this tag
Apache alternative of Kinesis
Kafka
Kinesis use cases
Logs
Metrics
IoT
ClickStream
Some streaming frameworks
Spark
NiFi
etc…
Kinesises
KDS: Low latency stream at scale
KDA: real-time analytics on stream using sql
KF: load stream into S3, Redshift, ES, Splunk
KVS: stream video in real-time
KDS Facts:
- provision
- retention
- replay data
- consumer quantity
- edit ingested data
- record size
- provision Shards in advance
- retention 24h to 7 days
- Ability to reprocess and replay data
- multiple consumer take off the same stream
- immutable
- 1MB
KDS Producer limits
1MB or 1000 messages /shard
Consumer Classic limits
2MB/s/Shard
5 API calls/s/Shard across all consumers
KDF min latency
near real-time
60 seconds
KDF targets
Redshift
Amazon S3
ElasticSearch
Splunk
KDF scaling
Managed Auto-Scaling
KDF Data conversion
CSV / JSON > Parquet / ORC
only for S3
KDF Data Transformation
using Lambda
CSV to JSON