ML Data Eng Flashcards
S3 standard storage class
Frequently accessed data
Low latency and high throughput
Can sustain 2 concurrent facility failures
Use cases:
Big data Analytics
Mobile gaming application
Content distribution
S3 Standard IA
Less frequently accessed, but requires rapid access when needed
Lower cost than standard
Minimum storage 30 days
99.9% availability
Use case:
Disaster recovery, backups
S3 one zone IA
Hugh durability within one AZ
Data lost if AZ is destroyed
Minimum storage 30 days
99.5% availability
Use cases:
Store backups of on premise data, or data you can re-create
S3 Glacier 3 types
Low cost: price for storage + object retrieval
3 types:
Instant retrieval:
Ms retrieval → great for data accessed once a quarter
Minimum storage duration of 90 days
Flexible retrieval :
Expedited (1 to 5 mins)
Standard (3 to 5 hours)
Bulk (5 to 12 hours)
Minimum storage of 90 days
Deep archive:
Standard - 12 hrs
Bulks 48 - 48 hrs
Min storage duration is 180 days
Use cases:
Archiving and backup
What is Intelligent tiering?
Small monitoring and auto-tiering charges
Moves objects automatically between access tiers based on usage
There are no retrieval charges in S3 intelligent tiering
Tiers:
Frequent access (automatic) : default
Infrequent Access (automatic) : objects not accessed for 90 days
Archive Instant Access (automatic): objects not accessed for 90 days
Archive archive tier (optional) : configurable form 90 days to 700+ days
Deep Archive Access tier (optional) : configurable from 180 days to 700+ days
Types of lifecycle rules?
Transition actions
E.g. move objects to IA 60 days after creation
Expiration actions
E.g. delete access logs after 365 days
E.g. Delete old versions of files
E.g. Delete incomplete multi-part uploads
Rules can be specified for a certain prefix or for certain tags
What is Amazon S3 Analytics?
Helps you decide when to transition objects to the right storage
Recommended for Standard and Standard IA
–> Does NOT work for One-Zone IA or Glacier
Report is updated daily as a CSV
24 to 48 hours to start seeing data analysis
S3 4 ways to encrypt?
SSE-S3 - S3 handles key
SSE-KMS - using KMS
Additional security (user must have access to KMS key → we can control access to key)
Audit trail for KMS key
SSE-C: when you manage your own key
Client Side encryption
S3 Security
User based → IAM
Resource based → bucket policy and ACLs (object and bucket level)
Bucket policy → to e.g. force encryption at upload
Default encryption
Can set this so AWS encrypts on upload → won’t encrypt existing objects
VPC endpoint Gateway
So traffic doesn’t go over internet
Logging and audit
Access logs can be stored in another S3 bucket
CloudTrail for API calls
Tags to control access
E.g. classification = PHI data
Kinesis overview
Managed alternative to Kafka
Application logs, metrics, IoT, clickstreams
“Real-time”
Great for stream processing frameworks (Spark, NiFi, etc)
Data is automatically replicated to 3 AZ
Kinesis Data Streams - what are shards and partitions?
Stream is made of shards and partitions
- Retention is 24 hrs by default, up to 365 days
- Ability to reprocess / replay data
- Multiple applications can consume data
- Once data is inserted into Kinesis, it can’t be deleted (immutability)
- Records can be up to 1MB
Kinesis Data Streams use cases?
Application logs
metrics
IoT
clickstreams
Kinesis Data Streams modes?
Provisioned:
You choose
Each shard gets 1MB/s or 1000 records/s in and 2 MB/s out (classic or enhanced fan-out mode)
You pay per shard per hour
On-demand:
No need to provision or manage
Default capacity provided (4MB/s in or 4000 records/s)
Scales automatically on observed throughput in last 30 days
Pay per stream, per hour and data in/out per GB
Kinesis Data Stream Producer?
Can send 1mb or 1000 record/s per shard
–> “ProvisionedThroughputException
Kinesis Data Stream Consumer?
2 MB/s at read per shard across all consumers
5 API calls per shard across all consumers