Cloud Storage Flashcards by Eric Sultini

3 Main Categories (steps) to the AWS Data Pipeline

Ingest (Gateway)
Transform and store (S3)
Serve and consume (EMR)

How well did you know this?

Not at all

Perfectly

Google’s version of AWS 3 step pipeline

Ingest
Analyze
Serve

(similar, but more generic used in professor example)

How well did you know this?

Not at all

Perfectly

Supporting services that supplement Ingestion, Analytics and Serve (2)

Storage
Security

Computing, networking

How well did you know this?

Not at all

Perfectly

DIKW pyramid

Data -> Information -> Knowledge -> Wisdom

How well did you know this?

Not at all

Perfectly

3 tiers of data structure

Structured
Semi-structured
unstructured

How well did you know this?

Not at all

Perfectly

3 levels of data abstraction

Block level (EC2)
file level (S3)
database mode (Amazon RDS)

How well did you know this?

Not at all

Perfectly

data access (models)

NoSQL and the 4 types
relational database (Amazon RDS)

How well did you know this?

Not at all

Perfectly

AWS S3

Simple storage service, place where you can store all types of data decoupled from processing, enabling a multi-user setup so different users can bring their own data while maintaining isolation and access control.

Object - file and metdata
bucket - logical containers for objects
(can configure access to buckets, geographical region for bucket)

How well did you know this?

Not at all

Perfectly

AWS S3 storage classes are like bank/investment accounts, why?

Different classes of access frequency, from S3 Standard (frequent) to S3 Deep Glacier (1 or 2 times a year!)

Its like how a bank offers better interest for accounts that are not accessed!

How well did you know this?

Not at all

Perfectly

Google Cloud Storage Classes (4)

Standard (frequent)
Nearline (Monthly)
Coldline (Yearly)
Archive (least frequent)

How well did you know this?

Not at all

Perfectly

Why is AWS better?

Offers intelligent-tiering class that automatically shifts data based on access patterns

How well did you know this?

Not at all

Perfectly

AWS Lifecycle Configuration

Set of rules that define actions that S3 will apply to a group of objects

Action types:

Transition - moving from one storage class to another (glacier - deep glacier)

Expiration - when to delete S3 objects

How well did you know this?

Not at all

Perfectly

Data Pipelines

Automated workflows that move and process data from one system to another, cleaning, transforming and enriching the data along the way

How well did you know this?

Not at all

Perfectly

Landing Area (LA) Data Lake

Where raw data is served from ingestion

How well did you know this?

Not at all

Perfectly

Staging Area (SA) Data Lake

Place where Raw data goes after basic quality transformations, ensuring that it conforms to existing schemas

How well did you know this?

Not at all

Perfectly

Archive Area (A)

Study These Flashcards

Stores the original raw data for future reference, debugging, or reprocessing

Production Area (PA) Data Lake

Study These Flashcards

Apply business logic to data from Staging Area (SA)

Aggregating: Summarizing sales by store, region or product
Business specific calculations (profit margin)

Pass-through job (Optional) Data Lake

Study These Flashcards

Copy of data from Staging Area (SA) is passed directly to Cloud Data Warehouse without business logic.

For comparison and debugging

Failed Area (FA) Data Lake

Study These Flashcards

Captures data that encounters issues such as bugs in pipeline code or cloud resources failures to deal with errors

4 folders used to organize data in logical structure (hint: file directory)

Study These Flashcards

Namespace - group pipelines together
Pipeline Name - reflecting purpose, for example pipeline that takes data from the LA applies processing steps and saves to SA could be one pipeline
Data Source name - assigned by ingestion layer
BatchID - unique identifier for any batch of data saved into LA

Namespace

Study These Flashcards

Groups multiple pipelines or areas logically, such as “landing,” “staging,” “archive,”

Pipeline Name

Study These Flashcards

Each pipeline gets a name reflecting its purpose (sales_oracle_ingest)

Data Source Name

Study These Flashcards

The specific data source, such as customer data

Batch ID

Study These Flashcards

Unique identifier that is assigned to each batch of ingested data into the landing area

Layers of cloud storage from ingestion (added) to archive and its corresponding AWS or Google Cloud tool

1. Ingestion Layer (load raw data from various sources into LA) - AWS Glue, Kinesis (stream) - Google Cloud DataFlow 2. Landing Area (Store raw, unprocessed data) - Amazon S3 - Google Cloud Storage 3. Staging Area (light transformations, quality check)

What comes out of the Production Area (PA) and where does it go

Data Products, Cloud Data Warehoes

Data Platform vs Data Lake

Lake - store, process, analyze raw data Platform - that + ingestion, querying, security, governance

Data Lakehouse

Low-cost storage in open, native format, accessible by a variety of systems (lake) combined with optimized queries on structured data (DWH)

Why is Lakehouse good for cloud environments?

Cloud storage often involves separation between storage and computing, renting instances only when needed for specific applications connected to the data storage and Lakehouse ACID qualities allow multiple users or applications to access data concurrently without inconsistencies

Data Independence

You can modify the schema at one level of database system without altering the schema at the next higher level (physical, logical, external levels)

3 cons to Data Warehouse and Lake

reliability is difficult and costly (both) data staleness (warehouse) limited support for machine learning (both)

Is there a real need for many unstructured and integrated dataset?

Governments and organizations are increasingly making structured data more available

Extra layer in DataLakehouse to prevent two-tiered ETL

Transactional Metadata layer

Delta Lake

Enhances data lake with features like ACID transactions, schema enforcement and improved query performance, being one of the technologies that turn a lake into a lakehouse

Special feature of Delta Lake

Uses a transaction log (JSON log records) to help optimize performance

Caching (Lakehouse Optimization)

When using transactional metadata layer such as Delta Lake, lakehouse can cache files from cloud to faster storage devices such as SSDs an RAM on nodes for processing

Auxiliary Data (Lakehoes Optimization)

Store min-max statistics for each data file (in same Parquet file used for transaction log) which enables skipping optimizations

Data Layout (Lakehouse Optimizations)

Order records to make them easier to read together?

Cloud Storage Flashcards

(38 cards)