Cloud Storage Flashcards
3 Main Categories (steps) to the AWS Data Pipeline
- Ingest (Gateway)
- Transform and store (S3)
- Serve and consume (EMR)
Google’s version of AWS 3 step pipeline
- Ingest
- Analyze
- Serve
(similar, but more generic used in professor example)
Supporting services that supplement Ingestion, Analytics and Serve (2)
- Storage
- Security
Computing, networking
DIKW pyramid
Data -> Information -> Knowledge -> Wisdom
3 tiers of data structure
Structured
Semi-structured
unstructured
3 levels of data abstraction
Block level (EC2)
file level (S3)
database mode (Amazon RDS)
data access (models)
NoSQL and the 4 types
relational database (Amazon RDS)
AWS S3
Simple storage service, place where you can store all types of data decoupled from processing, enabling a multi-user setup so different users can bring their own data while maintaining isolation and access control.
Object - file and metdata
bucket - logical containers for objects
(can configure access to buckets, geographical region for bucket)
AWS S3 storage classes are like bank/investment accounts, why?
Different classes of access frequency, from S3 Standard (frequent) to S3 Deep Glacier (1 or 2 times a year!)
Its like how a bank offers better interest for accounts that are not accessed!
Google Cloud Storage Classes (4)
- Standard (frequent)
- Nearline (Monthly)
- Coldline (Yearly)
- Archive (least frequent)
Why is AWS better?
Offers intelligent-tiering class that automatically shifts data based on access patterns
AWS Lifecycle Configuration
Set of rules that define actions that S3 will apply to a group of objects
Action types:
Transition - moving from one storage class to another (glacier - deep glacier)
Expiration - when to delete S3 objects
Data Pipelines
Automated workflows that move and process data from one system to another, cleaning, transforming and enriching the data along the way
Landing Area (LA) Data Lake
Where raw data is served from ingestion
Staging Area (SA) Data Lake
Place where Raw data goes after basic quality transformations, ensuring that it conforms to existing schemas
Archive Area (A)
Stores the original raw data for future reference, debugging, or reprocessing
Production Area (PA) Data Lake
Apply business logic to data from Staging Area (SA)
Aggregating: Summarizing sales by store, region or product
Business specific calculations (profit margin)
Pass-through job (Optional) Data Lake
Copy of data from Staging Area (SA) is passed directly to Cloud Data Warehouse without business logic.
For comparison and debugging
Failed Area (FA) Data Lake
Captures data that encounters issues such as bugs in pipeline code or cloud resources failures to deal with errors
4 folders used to organize data in logical structure (hint: file directory)
Namespace - group pipelines together
Pipeline Name - reflecting purpose, for example pipeline that takes data from the LA applies processing steps and saves to SA could be one pipeline
Data Source name - assigned by ingestion layer
BatchID - unique identifier for any batch of data saved into LA
Namespace
Groups multiple pipelines or areas logically, such as “landing,” “staging,” “archive,”
Pipeline Name
Each pipeline gets a name reflecting its purpose (sales_oracle_ingest)
Data Source Name
The specific data source, such as customer data
Batch ID
Unique identifier that is assigned to each batch of ingested data into the landing area
Layers of cloud storage from ingestion (added) to archive and its corresponding AWS or Google Cloud tool
- Ingestion Layer (load raw data from various sources into LA)
- AWS Glue, Kinesis (stream)
- Google Cloud DataFlow
- Landing Area (Store raw, unprocessed data)
- Amazon S3
- Google Cloud Storage
- Staging Area (light transformations, quality check)
What comes out of the Production Area (PA) and where does it go
Data Products, Cloud Data Warehoes
Data Platform vs Data Lake
Lake - store, process, analyze raw data
Platform - that + ingestion, querying, security, governance
Data Lakehouse
Low-cost storage in open, native format, accessible by a variety of systems (lake) combined with optimized queries on structured data (DWH)
Why is Lakehouse good for cloud environments?
Cloud storage often involves separation between storage and computing, renting instances only when needed for specific applications connected to the data storage and Lakehouse ACID qualities allow multiple users or applications to access data concurrently without inconsistencies
Data Independence
You can modify the schema at one level of database system without altering the schema at the next higher level (physical, logical, external levels)
3 cons to Data Warehouse and Lake
reliability is difficult and costly (both)
data staleness (warehouse)
limited support for machine learning (both)
Is there a real need for many unstructured and integrated dataset?
Governments and organizations are increasingly making structured data more available
Extra layer in DataLakehouse to prevent two-tiered ETL
Transactional Metadata layer
Delta Lake
Enhances data lake with features like ACID transactions, schema enforcement and improved query performance, being one of the technologies that turn a lake into a lakehouse
Special feature of Delta Lake
Uses a transaction log (JSON log records) to help optimize performance
Caching (Lakehouse Optimization)
When using transactional metadata layer such as Delta Lake, lakehouse can cache files from cloud to faster storage devices such as SSDs an RAM on nodes for processing
Auxiliary Data (Lakehoes Optimization)
Store min-max statistics for each data file (in same Parquet file used for transaction log) which enables skipping optimizations
Data Layout (Lakehouse Optimizations)
Order records to make them easier to read together?