Cloud Storage ChatGPT Version Flashcards

Question 1

Q

What is a data pipeline in the context of big data?

Answer

A

A data pipeline is a sequence of operations designed to transform and consume raw data for analysis and storage

Question 2

Q

What are the main categories of AWS data pipeline components?

Answer

A

Ingest: Gateway, DataSync (batch), Kinesis, SNS, SQS (stream)
Transform and Store: S3, Glacier (storage), Glue (ETL)
Serve and Consume: EMR (Hadoop-like clusters), Athena, Machine learning services

Question 3

Q

What are the three main categories of Google Cloud data pipeline components?

Answer

A

Ingest: Transfer service (batch), Pub/Sub (stream)
Analyze: Dataproc (batch), Dataflow (stream), Cloud Storage
Serve: BigQuery

Question 4

Q

What are the different data storage models and their characteristics?

Answer

A

Structured: Predefined schema and relationships, supports ACID transactions.
Semi-structured: Data stored as documents (e.g., JSON-like).
Unstructured: Stored as files or blobs, accessed via key-value pairs.

Question 5

Q

What is AWS S3, and what are its benefits?

Answer

A

Definition: Simple Storage Service (S3) is serverless storage where data is stored as objects in buckets.
Benefits: Unified data architecture, cost-effective, scalable, supports data governance, and decouples storage from compute.

Question 6

Q

List the storage classes in AWS S3 and their use cases.

Answer

A

Standard: General-purpose storage.
Infrequent Access (IA): For data that is less frequently accessed.
One Zone-IA: Lower cost for data that doesn’t require high availability.
Glacier: For archiving with different retrieval speeds.
Deep Glacier: Long-term storage for data accessed once or twice a year.
Intelligent-Tiering: Automatically moves data between tiers based on access patterns.

Question 7

Q

What are lifecycle configuration actions in AWS S3?

Answer

A

Transition: Moves objects to different storage classes based on policies.
Expiration: Deletes expired objects automatically.

Question 8

Q

What is a data lake, and how is it structured?

Answer

A

Answer: A data lake is a central repository for storing, processing, and analyzing raw data in various formats. It is organized into areas like:

Landing Area (LA): Holds raw, ingested data temporarily.
Staging Area (SA): Data undergoes initial processing.
Archive Area (AA): Stores raw data from LA for reprocessing.
Production Area (PA): Contains processed data ready for use.
Failed Area (FA): Handles errors and failed processes.

Question 9

Q

What is a data lakehouse, and what are its benefits?

Answer

A

A data lakehouse combines features of data lakes and data warehouses, offering low-cost storage, management features, and support for ACID transactions, data versioning, auditing, and query optimization.

Question 10

Q

Describe the storage access models supported by AWS.

Answer

A

Relational: Structured, schema-based with ACID support.
Key-value: Large volume data retrieval (e.g., DynamoDB).
Document: Semi-structured data stored as documents.
Columnar: Flexible column structure.
Graph: For querying relationships between data sets.

Question 11

Q

What are common challenges with cloud data storage and processing?

Answer

A

Latency: Metadata operations can be slower due to cloud storage delays.
Consistency: Ensuring data consistency across different storage systems can be difficult.
Reliability: Keeping data consistent between data lakes and warehouses is complex.

Question 12

Q

What is Delta Lake, and what are its features?