Cloud Storage Flashcards

1
Q

3 Main Categories (steps) to the AWS Data Pipeline

A
  1. Ingest (Gateway)
  2. Transform and store (S3)
  3. Serve and consume (EMR)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Google’s version of AWS 3 step pipeline

A
  1. Ingest
  2. Analyze
  3. Serve

(similar, but more generic used in professor example)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Supporting services that supplement Ingestion, Analytics and Serve (2)

A
  1. Storage
  2. Security

Computing, networking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DIKW pyramid

A

Data -> Information -> Knowledge -> Wisdom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

3 tiers of data structure

A

Structured
Semi-structured
unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

3 levels of data abstraction

A

Block level (EC2)
file level (S3)
database mode (Amazon RDS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

data access (models)

A

NoSQL and the 4 types
relational database (Amazon RDS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AWS S3

A

Simple storage service, place where you can store all types of data decoupled from processing, enabling a multi-user setup so different users can bring their own data while maintaining isolation and access control.

Object - file and metdata
bucket - logical containers for objects
(can configure access to buckets, geographical region for bucket)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS S3 storage classes are like bank/investment accounts, why?

A

Different classes of access frequency, from S3 Standard (frequent) to S3 Deep Glacier (1 or 2 times a year!)

Its like how a bank offers better interest for accounts that are not accessed!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Google Cloud Storage Classes (4)

A
  1. Standard (frequent)
  2. Nearline (Monthly)
  3. Coldline (Yearly)
  4. Archive (least frequent)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is AWS better?

A

Offers intelligent-tiering class that automatically shifts data based on access patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AWS Lifecycle Configuration

A

Set of rules that define actions that S3 will apply to a group of objects

Action types:

Transition - moving from one storage class to another (glacier - deep glacier)

Expiration - when to delete S3 objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Pipelines

A

Automated workflows that move and process data from one system to another, cleaning, transforming and enriching the data along the way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Landing Area (LA) Data Lake

A

Where raw data is served from ingestion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Staging Area (SA) Data Lake

A

Place where Raw data goes after basic quality transformations, ensuring that it conforms to existing schemas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Archive Area (A)

A

Stores the original raw data for future reference, debugging, or reprocessing

17
Q

Production Area (PA) Data Lake

A

Apply business logic to data from Staging Area (SA)

Aggregating: Summarizing sales by store, region or product
Business specific calculations (profit margin)

18
Q

Pass-through job (Optional) Data Lake

A

Copy of data from Staging Area (SA) is passed directly to Cloud Data Warehouse without business logic.

For comparison and debugging

19
Q

Failed Area (FA) Data Lake

A

Captures data that encounters issues such as bugs in pipeline code or cloud resources failures to deal with errors

20
Q

4 folders used to organize data in logical structure (hint: file directory)

A

Namespace - group pipelines together
Pipeline Name - reflecting purpose, for example pipeline that takes data from the LA applies processing steps and saves to SA could be one pipeline
Data Source name - assigned by ingestion layer
BatchID - unique identifier for any batch of data saved into LA

21
Q

Namespace

A

Groups multiple pipelines or areas logically, such as “landing,” “staging,” “archive,”

22
Q

Pipeline Name

A

Each pipeline gets a name reflecting its purpose (sales_oracle_ingest)

23
Q

Data Source Name

A

The specific data source, such as customer data

24
Q

Batch ID

A

Unique identifier that is assigned to each batch of ingested data into the landing area

25
Q

Layers of cloud storage from ingestion (added) to archive and its corresponding AWS or Google Cloud tool

A
  1. Ingestion Layer (load raw data from various sources into LA)
    • AWS Glue, Kinesis (stream)
    • Google Cloud DataFlow
  2. Landing Area (Store raw, unprocessed data)
    • Amazon S3
    • Google Cloud Storage
  3. Staging Area (light transformations, quality check)
26
Q

What comes out of the Production Area (PA) and where does it go

A

Data Products, Cloud Data Warehoes

27
Q

Data Platform vs Data Lake

A

Lake - store, process, analyze raw data
Platform - that + ingestion, querying, security, governance

28
Q

Data Lakehouse

A

Low-cost storage in open, native format, accessible by a variety of systems (lake) combined with optimized queries on structured data (DWH)

29
Q

Why is Lakehouse good for cloud environments?

A

Cloud storage often involves separation between storage and computing, renting instances only when needed for specific applications connected to the data storage and Lakehouse ACID qualities allow multiple users or applications to access data concurrently without inconsistencies

30
Q

Data Independence

A

You can modify the schema at one level of database system without altering the schema at the next higher level (physical, logical, external levels)

31
Q

3 cons to Data Warehouse and Lake

A

reliability is difficult and costly (both)
data staleness (warehouse)
limited support for machine learning (both)

32
Q

Is there a real need for many unstructured and integrated dataset?

A

Governments and organizations are increasingly making structured data more available

33
Q

Extra layer in DataLakehouse to prevent two-tiered ETL

A

Transactional Metadata layer

34
Q

Delta Lake

A

Enhances data lake with features like ACID transactions, schema enforcement and improved query performance, being one of the technologies that turn a lake into a lakehouse

35
Q

Special feature of Delta Lake

A

Uses a transaction log (JSON log records) to help optimize performance

36
Q

Caching (Lakehouse Optimization)

A

When using transactional metadata layer such as Delta Lake, lakehouse can cache files from cloud to faster storage devices such as SSDs an RAM on nodes for processing

37
Q

Auxiliary Data (Lakehoes Optimization)

A

Store min-max statistics for each data file (in same Parquet file used for transaction log) which enables skipping optimizations

38
Q

Data Layout (Lakehouse Optimizations)

A

Order records to make them easier to read together?