Data Engineering Fundamentals - SQL Review Flashcards
A Data Engineer is tasked with creating a report on an Amazon RDS instance to support a customer loyalty analysis. The database contains two tables: Customers (listing all registered customers) and Orders (recording customer purchases). The report should retrieve all customers’ information along with their order details if they have placed any orders. Which SQL query ensures that every customer is included, and orders are shown only if they exist?
a) INNER JOIN on Customers.CustomerID = Order.CustomerID
b) FULL JOIN on Customers.CustomerID = Order.CustomerID
c) LEFT JOIN on Customers.CustomerID = Order.CustomerID
d) RIGHT JOIN on Customers.CustomerID = Order.CustomerID
LEFT JOIN
as if the the cusomter has no order their order details will be shown as NULL. Meeting the requirement to include all customers and show order details.
Question 2:
A Data Engineer discovers a critical bug in the production data transformation pipeline managed in a Git repository on the master
branch. To comply with the company’s policy of using feature branches for bug fixes and enhancements, which sequence of Git commands should the engineer use to set up their development environment to address this issue?
a. git clone followed by git checkout -b
b. git clone followed by git branch -b
c. git pull followed by git branch -b
d. git fetch followed by git switch -b
a.
git clone git checkout -b
Used to make a local copy of the specified repos. Following this, the git checkout -b command creates a new branch and switches to it immediately. This sequence is appropriate for starting work on a new branch, such as hitfix branch directly after cloning a repo.
Data Engineer is using AWS Redshift to analyze sales data from a retail company. The table, Monthly_Sales
, includes columns for SaleID
, ProductCategory
, SaleMonth
, and SaleAmount
. The task is to generate a monthly sales report, where each product category (Electronics, Clothing, Furniture) is displayed as a column header and each row represents a month with the total sales for that category. Which SQL operation should be used in AWS Redshift to efficiently generate the report according to the requirement?
The PIVOT operation is designed to transform rows into columns, which is precisely what is required here. It enables the aggregation of SaleAmount
for each ProductCategory
into separate columns, with each row grouped by SaleMonth
. This efficiently organizes the sales data into a clear, report-friendly format.
Although group by can be used. Its far less efficient then pivot.
A Data Engineer at a healthcare organization is tasked with analyzing patient satisfaction across various departments to identify areas for improvement. The patient population is diverse, with significant variations in age, treatment types, and outcomes. To ensure that the analysis is comprehensive and accounts for the diverse characteristics of the patient groups, the engineer must choose an appropriate method for sampling the data from the hospital’s patient records database. What technique should the engineer use to accurately reflect the different segments of the patient population in the analysis?
a) Random sampling
b) Systematic sampling
c) Minimax sampling
d) Stratified sampling
Stratified sampling
This technique is ideal for ensuring that all key subgroups within a population are proportionally represented in the sample. In the context of a healthcare organization analyzing patient satisfaction across diverse demographic and treatment categories, stratified sampling allows the engineer to maintain representation across these different segments, thereby providing more accurate and relevant insights.
A data engineering team is using Apache Spark to process a large dataset comprising user activity logs. The dataset is distributed across multiple nodes in a Spark cluster. During the processing, the team notices that some tasks are taking significantly longer to complete than others, causing delays in the overall processing time. Additionally, they observe that certain nodes in the cluster are consistently under heavier load compared to others.
a) the network bandwidth between the nodes is insufficient
b) The data is configured with a low number of partitions
d) There is a data skew in the input dataset
e)The cluster nodes have different hardware capacities.
d)
Data skew occurs when the data is not evenly distributed among the partitions, leading to some nodes processing significantly more data than others. This results in those nodes having higher workloads and taking longer to complete tasks. Recognizing data skew is critical because it can severely affect the performance and scalability of a Spark application by causing bottlenecks at certain nodes.
A Data Engineer is tasked with developing a scalable data processing solution using AWS services to analyze game participation data. The solution needs to leverage data stored in three PostgreSQL tables within an Amazon RDS instance, as depicted in the provided ERD. The games
table includes fields for id
, name
, and time
, and the players
table includes fields for id
, and name
, with table `games_players’ linking both the players and the games they participated in. Each player is allowed to participate once in a given game.
table
public.games_players
game_id
player_id
public.games
id
name
time
public.players
id
name
Based on the ER diagram which constraint on public.games_players
table is correct?
a) CONSTRAINT fk_game FOREIGN KEY (player_id) REFERENCES public.games(id).
b) CONSTRAINT fk_game FOREIGN KEY (game_id) REFERENCES public.games(id).
c) CONSTRAINT fk_game FOREIGN KEY (id) REFERENCES public.games(id).
d) CONSTRAINT fk_game FOREIGN KEY (game_id) REFERENCES public.games(name).
b) This constraint ensures that the game_id
column in the public.games_players
table references the id
column in the public.games
table. This relationship aligns with the ER diagram where games_players
serves as a linking table between games
and players
. The foreign key should point to the primary key of the referenced table (public.games
), which is id
. Therefore, the correct constraint should reference game_id
and public.games(id)
to maintain data integrity and adhere to the ER diagram’s structure.
A product owner at a retail company intends to disable an existing data pipeline that aggregates sales data across various departments. Before proceeding, the product owner wants to understand the impact this action will have on downstream processes and reports that rely on this data. To assist in this analysis, which approach should the data engineering team implement?
a) introduce an additional layer of data validation.
b) enhance the data transformation logic.
c) Createa backup of the existing data sets
d) Implement data lineage throughput the pipeline
d)
Implementing data lineage provides a clear visualization of where data originates, how it moves through various processes, and where it is utilized. This visibility is crucial for assessing the impact of disabling a pipeline, as it helps identify all downstream processes, reports, and systems that depend on the data provided by the pipeline. Data lineage enables the team to make informed decisions by highlighting potential disruptions and dependencies.
A product owner at a financial analytics company is looking to reduce storage costs and enhance the performance of SQL queries on their large datasets of transaction records. The datasets are currently stored in a traditional row-oriented format, which has led to increased storage needs and slower query response times. To address these concerns, which file format should the data engineering team transition their data storage to?
a) migrate the data to a more efficient relational database
b) archive older data to cold storage
c) Convert the datasets to a columnar storage format
d) implement data de-duplication across the datasets.
c)
Converting the data to a columnar storage format, such as Parquet, aligns perfectly with the goals of reducing storage costs and boosting query performance. Columnar formats store data by columns rather than rows, making them ideal for analytics and complex queries as only the necessary columns need to be read and processed. This leads to faster retrieval times and significant reductions in storage space, especially when dealing with large datasets.
An insurance company is planning to launch a new product that will utilize diverse data sources, including data from transactional systems, customer emails, and weblogs. To support analytics and machine learning models that will help tailor and optimize this product, which data storage solution should the data engineering team choose?
a) store all logs, emails and transactional data in a relational database.
b) store all logs, emails and transactional data in a data warehouse.
c) store all logs, emails and transactional data in a in memory database.
d) store all logs, emails and transactional data in a data lake.
d)
A data lake can store vast amounts of raw data in its native format, including structured, unstructured, and semi-structured data. This versatility makes data lakes ideal for supporting analytics and machine learning applications that require diverse datasets.
A data engineer at a digital marketing firm is tasked with integrating various data sources for advanced analytics. The company collects data from social media interactions, website logs, and customer feedback surveys. Given the nature of data originating from multiple sources, which of the following describes the data correctly?
a) structured data
b) semi_structured data
c) Unstructured data
e) Free-form data
b) The data collected from social media interactions, website logs, and customer feedback surveys typically includes a mix of formats like text, metadata, and possibly JSON or XML, classifying it as semi-structured. This type incorporates elements of both structured and unstructured data, making it not strictly adherent to a rigid schema but still containing some organizational properties.