Data Engineering Fundamentals Flashcards by America Lagos Hernández

Who coined the term “Data Lake” and why?

James Dixon to describe to describe a flexible storage solution

How well did you know this?

Not at all

Perfectly

Why did the concept of data lakes evolve?

The evolution of data lakes is attributed to the need for storage solutions that can handle the vast and varied nature of modern digital data.

How well did you know this?

Not at all

Perfectly

What is a primary benefit of a data lake?

Centralized, flexible, and scalable for various data types Explanation: Data lakes offer centralization, allowing data from multiple sources to be stored in one place. They are flexible in handling various data types and scalable to accommodate growing data volumes.

How well did you know this?

Not at all

Perfectly

What challenge does a ‘data swamp’ represent in the context of data lakes?

oper data management and governance Explanation: A data swamp occurs when there is poor management and governance of a data lake, leading to inaccessible, non-compliant, and low-quality data. It emphasizes the need for strict governance and metadata management.

How well did you know this?

Not at all

Perfectly

How does a data lake support cost-effectiveness?

Through cloud-based solutions with pay-as-you-go pricing Explanation: Data lakes, particularly cloud-based ones like AWS S3 or Azure, offer cost-effective storage solutions. They are beneficial for startups as they provide scalable storage with a pay-as-you-go model, avoiding large upfront costs.

How well did you know this?

Not at all

Perfectly

What is the primary purpose of storage in a data lake architecture?

To store petabytes of data from diverse sources. Explanation: The core of any data lake is its storage capacity, which is designed to be scalable, robust, and capable of storing vast amounts of data from a variety of sources.

How well did you know this?

Not at all

Perfectly

Which AWS service is primarily used for metadata management in a data lake?

AWS Glue is used for data discovery, preparation, and cataloging, playing a key role in organizing metadata within a data lake.

How well did you know this?

Not at all

Perfectly

What does orchestration in the context of a data lake refer to?

Coordination and management of data processing and integration tasks. Explanation: Orchestration in a data lake involves the coordination and management of various data processing and integration tasks to ensure they operate in a coordinated and efficient manner.

How well did you know this?

Not at all

Perfectly

What role does governance play in the success of a data lake?

Governance in a data lake is crucial for maintaining data integrity, security, and responsible data management, thereby contributing significantly to the success of the data lake.

How well did you know this?

Not at all

Perfectly

Which of the following best describes a Data Lake?

Data lakes store a vast range of data types, including unstructured and semi-structured data, in their raw formats. They are known for their flexibility and scalability.

How well did you know this?

Not at all

Perfectly

What is a significant limitation of a Data Warehouse compared to a Data Lake?

Data Warehouses are designed for structured data and have limitations in handling raw, unstructured, or semi-structured data, making them less flexible compared to data lakes.

How well did you know this?

Not at all

Perfectly

What is a Lakehouse in data management?

A Lakehouse is an emerging concept that combines the benefits of both Data Warehouses (structured querying capabilities) and Data Lakes (flexibility in handling various data types).

How well did you know this?

Not at all

Perfectly

Which AWS service is commonly used for scalable and robust storage in a data lake?

Amazon S3 is a reliable solution for data lake storage, offering durability, availability, and scalability to handle petabytes of data from diverse sources.

How well did you know this?

Not at all

Perfectly

What is the primary role of governance in a data lake?

Governance in a data lake is crucial to ensure responsible data management, maintaining data integrity, and managing security risks.

How well did you know this?

Not at all

Perfectly

What does the orchestration component in a data lake architecture refer to?

Orchestration involves the coordination and management of various data processing and integration tasks, ensuring they operate efficiently and cohesively.

How well did you know this?

Not at all

Perfectly

Which AWS service aids in metadata management within a data lake?

AWS Glue helps with the discovery, preparation, cataloging, and organization of metadata, making the data in the data lake searchable and usable.

How well did you know this?

Not at all

Perfectly

Why is it important to choose the right data format in a data lake?

The choice of data format in a data lake has a significant impact on performance and storage costs. Different formats offer benefits in terms of efficiency, functionality, and cost management as data moves through various zones in the data lake.

How well did you know this?

Not at all

Perfectly

What are the two primary categories of data formats in data lakes?

Row and Columnar formats are the two main categories. Row formats (like CSV, JSON, Avro) store data row by row, useful for ingestion but less efficient for analytics. Columnar formats (like Parquet, ORC) store data by columns, offering better performance for analytical queries and storage efficiency.

How well did you know this?

Not at all

Perfectly

What is a key advantage of columnar formats such as Parquet and ORC in a data lake?

Columnar formats store data by columns, which is advantageous for analytics as it allows efficient access to specific columns and enables effective columnar compression, leading to improved storage efficiency and parallel processing capabilities.

How well did you know this?

Not at all

Perfectly

In which zone of a data lake are columnar formats typically dominant?

In the Curated Zone of a data lake, columnar formats like Parquet and ORC are predominant due to their analytical efficiency. This zone focuses on efficient querying and analytics, making columnar formats more suitable than row formats

How well did you know this?

Not at all

Perfectly

Why is batch ingestion typically scheduled during off-peak hours?

Study These Flashcards

Batch ingestion is scheduled during off-peak hours to minimize the load on productive source systems and optimize the utilization of resources. This ensures that the ingestion process does not interfere with critical operations and maintains system efficiency.

What is a key benefit of batch ingestion over real-time ingestion?

Study These Flashcards

Batch ingestion is more suited for complex data transformations and cleansing that might be resource-intensive for real-time processing. This method allows for the handling of data in large, accumulated blocks, making it easier to apply comprehensive transformations.

Which of the following tools is commonly used for batch ingestion in AWS?

Study These Flashcards

AWS Glue is a widely used service for batch ingestion in AWS. It is designed for ETL (extract, transform, load) operations, allowing for the extraction of data from various sources, transformation if necessary, and loading into the data lake.

In batch ingestion, how is the batch size and frequency generally determined?

Study These Flashcards

The batch size and frequency in batch ingestion are dictated by the volume and complexity of the data. These factors determine how often and how much data should be ingested to balance efficiency with the needs of downstream users and system constraints.

What is the purpose of creating two distinct buckets (source and target) in this AWS event-driven ingestion setup?

The correct answer is to avoid creating a loop when triggering the Lambda function. By having separate source and target buckets, the Lambda function won't inadvertently trigger itself by placing the processed file back into the source bucket.

What is a key characteristic of data streaming in contrast to traditional batch processing?

The correct answer is Data streaming processes records in a continuous flow as they are generated. The script highlights that data streaming involves a constant flow of data that is processed in real-time or near real-time as it is generated.

Which of the following is a common technology used for data streaming, according to the script?

The correct answer is Apache Kafka. The script mentions Apache Kafka as a popular open-source streaming platform used for real-time data pipelines.

What is the role of partition keys in data streaming?

The correct answer is They help efficiently distribute and organize data within a stream. Partition keys are described in the script as labels on the data that assist in deciding which shard (container for data) the data should go to, ensuring efficient distribution and organization within a stream.

Which zone in the data lake is optimized for experimentation and development?

Exploratory Zone. The script mentions that the Exploratory Zone is a dedicated zone for experimentation and development, where new things are tested, particularly by data scientists developing and testing new products.

What is the purpose of the raw data zone in a data lake?

to collect data in its original format from various sources. The script mentions that in the raw data zone, data is collected in its original, unaltered format from different sources.

Why is it recommended to have a distinct AWS account for production and development environments in a data lake setup?

to avoid mixing production and development data. The script emphasizes the importance of a distinct setup with separate AWS accounts to prevent mixing production and development environments.

Why is implementing a folder structure in a data lake important?

Implementing a folder structure is crucial for several reasons. It simplifies data management tasks, such as retention and archival, by allowing easy organization and deletion of older data partitions without affecting the rest of the data. Additionally, it significantly improves query performance by enabling queries to process only the relevant subset of data, reducing the amount of data scanned and, consequently, lowering costs.

Why is it important to maintain a consistent file structure within subfolders of a data lake bucket?

Maintaining a consistent file structure within subfolders is crucial for efficient querying and data access. If files within a subfolder have the same structure and format, it optimizes the retrieval process. Querying becomes more efficient when dealing with files that share the same structure, improving the overall performance of the data lake.

At which stage will the analytics engineering workload be performed?

Data Integration

Which one of the following choices is NOT one of the advantages of ELT compared to ETL approach?

Lower Storage Requirement

How do we scale an MPP database?

MPP can scale horizontally by adding computing nodes.

Columnar stores are typically better for OLAP (Online Analytical Processing) apps. ¿True or False?

True

ELT is typically used in the modern data stack, ¿True or False?

True

What does decoupled storage mean?

Data lives in a shared data store and the Compute can be configured to point to the same datasets in the shared storage area.

The values of an attribute column of which one of the following Slowly Changing Dimension (SCD) types never change?

Type 0: Retain original

Which one of the following Slowly Changing Dimension (SCD) types tracks all historical data?

Type 2: Add new role

Which dbt command materializes a dbt model?

dbt run

Which materialization to use when you work with an influx of event data and you want to append new records to an already existing table?

Incremental models work best here.

Which materialization to use when you don't need any "actual" materialization of your model

Ephemeral Materializations compile to Common Table Expression CTEs.

What type of SCD tables do Snapshots implement?

Type 2

What are Singular Tests?

Singular Tests are pre-defined tests you can add to a yml file in dbt

Macros can be used for:

Custom tests and also for implementing reusable SQL code.

Select the correct syntax for defining docs in yml files.

'{{ doc("dim_listing_cleansed__min_nights_str") }}'

You define a post-hook in dbt_profile.yml. When does it get executed?

after the execution of every associated model.

Data Engineering Fundamentals Flashcards

(50 cards)