Cloud Storage ChatGPT Version Flashcards
What is a data pipeline in the context of big data?
A data pipeline is a sequence of operations designed to transform and consume raw data for analysis and storage
What are the main categories of AWS data pipeline components?
Ingest: Gateway, DataSync (batch), Kinesis, SNS, SQS (stream)
Transform and Store: S3, Glacier (storage), Glue (ETL)
Serve and Consume: EMR (Hadoop-like clusters), Athena, Machine learning services
What are the three main categories of Google Cloud data pipeline components?
Ingest: Transfer service (batch), Pub/Sub (stream)
Analyze: Dataproc (batch), Dataflow (stream), Cloud Storage
Serve: BigQuery
What are the different data storage models and their characteristics?
Structured: Predefined schema and relationships, supports ACID transactions.
Semi-structured: Data stored as documents (e.g., JSON-like).
Unstructured: Stored as files or blobs, accessed via key-value pairs.
What is AWS S3, and what are its benefits?
Definition: Simple Storage Service (S3) is serverless storage where data is stored as objects in buckets.
Benefits: Unified data architecture, cost-effective, scalable, supports data governance, and decouples storage from compute.
List the storage classes in AWS S3 and their use cases.
Standard: General-purpose storage.
Infrequent Access (IA): For data that is less frequently accessed.
One Zone-IA: Lower cost for data that doesn’t require high availability.
Glacier: For archiving with different retrieval speeds.
Deep Glacier: Long-term storage for data accessed once or twice a year.
Intelligent-Tiering: Automatically moves data between tiers based on access patterns.
What are lifecycle configuration actions in AWS S3?
Transition: Moves objects to different storage classes based on policies.
Expiration: Deletes expired objects automatically.
What is a data lake, and how is it structured?
Answer: A data lake is a central repository for storing, processing, and analyzing raw data in various formats. It is organized into areas like:
Landing Area (LA): Holds raw, ingested data temporarily. Staging Area (SA): Data undergoes initial processing. Archive Area (AA): Stores raw data from LA for reprocessing. Production Area (PA): Contains processed data ready for use. Failed Area (FA): Handles errors and failed processes.
What is a data lakehouse, and what are its benefits?
A data lakehouse combines features of data lakes and data warehouses, offering low-cost storage, management features, and support for ACID transactions, data versioning, auditing, and query optimization.
Describe the storage access models supported by AWS.
Relational: Structured, schema-based with ACID support.
Key-value: Large volume data retrieval (e.g., DynamoDB).
Document: Semi-structured data stored as documents.
Columnar: Flexible column structure.
Graph: For querying relationships between data sets.
What are common challenges with cloud data storage and processing?
Latency: Metadata operations can be slower due to cloud storage delays.
Consistency: Ensuring data consistency across different storage systems can be difficult.
Reliability: Keeping data consistent between data lakes and warehouses is complex.
What is Delta Lake, and what are its features?