Cloud Storage Flashcards
3 Main Categories (steps) to the AWS Data Pipeline
- Ingest (Gateway)
- Transform and store (S3)
- Serve and consume (EMR)
Google’s version of AWS 3 step pipeline
- Ingest
- Analyze
- Serve
(similar, but more generic used in professor example)
Supporting services that supplement Ingestion, Analytics and Serve (2)
- Storage
- Security
Computing, networking
DIKW pyramid
Data -> Information -> Knowledge -> Wisdom
3 tiers of data structure
Structured
Semi-structured
unstructured
3 levels of data abstraction
Block level (EC2)
file level (S3)
database mode (Amazon RDS)
data access (models)
NoSQL and the 4 types
relational database (Amazon RDS)
AWS S3
Simple storage service, place where you can store all types of data decoupled from processing, enabling a multi-user setup so different users can bring their own data while maintaining isolation and access control.
Object - file and metdata
bucket - logical containers for objects
(can configure access to buckets, geographical region for bucket)
AWS S3 storage classes are like bank/investment accounts, why?
Different classes of access frequency, from S3 Standard (frequent) to S3 Deep Glacier (1 or 2 times a year!)
Its like how a bank offers better interest for accounts that are not accessed!
Google Cloud Storage Classes (4)
- Standard (frequent)
- Nearline (Monthly)
- Coldline (Yearly)
- Archive (least frequent)
Why is AWS better?
Offers intelligent-tiering class that automatically shifts data based on access patterns
AWS Lifecycle Configuration
Set of rules that define actions that S3 will apply to a group of objects
Action types:
Transition - moving from one storage class to another (glacier - deep glacier)
Expiration - when to delete S3 objects
Data Pipelines
Automated workflows that move and process data from one system to another, cleaning, transforming and enriching the data along the way
Landing Area (LA) Data Lake
Where raw data is served from ingestion
Staging Area (SA) Data Lake
Place where Raw data goes after basic quality transformations, ensuring that it conforms to existing schemas