Intermediate Level Flashcards
Category: Real-World Troubleshooting
You are running a DBT model, but it fails with a duplicate column error. What is the most likely cause?
A. The ref() function is referencing the same table multiple times.
B. The model query selects the same column name multiple times without aliasing.
C. The unique_key configuration is missing in the model.
D. The model materialization is set to ephemeral.
Correct Answer: B. The model query selects the same column name multiple times without aliasing. 🎉
Explanation:
A duplicate column error occurs in DBT when the query tries to select multiple columns with the same name in the same result set. This can happen if you include the same column multiple times without assigning a unique alias.
Example of the Issue:
Problem Query:sql
SELECT id, id, name
FROM {{ ref(‘customers’) }}
- This will cause a duplicate column error because the id column appears twice without a unique alias.
Fixed Query: To resolve the issue, you need to assign unique aliases to duplicate columns:sql
SELECT id AS customer_id, id AS order_id, name
FROM {{ ref(‘customers’) }}
- Now each id column has a unique name (customer_id, order_id), avoiding the error.
Category: Jinja and Macros
What is the purpose of the is_incremental() function in DBT?
A. To check if a table has a unique_key.
B. To determine whether the model is running in incremental mode.
C. To enable testing of unique constraints in incremental models.
D. To validate schema changes during an incremental run.
The correct answer is B. To determine whether the model is running in incremental mode. ❌
Explanation:
The is_incremental() function in DBT is used to check whether the model is running in incremental mode. This is especially helpful when defining logic for models that are materialized as incremental.
When is_incremental() evaluates to True, the model is running incrementally.
If False, the model is running as a full refresh or the first-time build.
Example Use Case:
An incremental model that processes only new or updated rows:
sql
{{ config(materialized=’incremental’) }}
SELECT id, name, updated_at
FROM source(‘raw_data’, ‘customers’)
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
- On an initial run, the entire query is executed because is_incremental() is False.
- On subsequent runs, only rows with updated_at greater than the last processed timestamp are selected because is_incremental() is True.
Why the Other Options Are Incorrect:
A (To check if a table has a unique_key): unique_key is configured explicitly for incremental models, but is_incremental() does not check this.
C (To enable testing of unique constraints in incremental models): Testing is handled by dbt test, not is_incremental().
D (To validate schema changes during an incremental run): Schema changes are controlled by the on_schema_change configuration, not is_incremental().
Category: Dependency Management (ref() and DAG)
What does the Directed Acyclic Graph (DAG) in DBT represent?
A. The order in which sources, models, and tests are executed.
B. A list of all materialized tables in the project.
C. A lineage diagram of SQL queries only in the models directory.
D. The list of all seeds and snapshots in the project.
Correct Answer: A. The order in which sources, models, and tests are executed. 🎉
Explanation:
The Directed Acyclic Graph (DAG) in DBT represents the execution order of sources, models, and tests based on their dependencies. It is constructed using ref() and source() functions, which define relationships between resources in your DBT project.
Key Details About the DAG:
* Directed: Each node (e.g., a model or source) points to its downstream dependencies.
* Acyclic: The graph cannot have circular dependencies (e.g., Model A depends on Model B, and Model B depends on Model A).
Why the Other Options Are Incorrect:
B (A list of all materialized tables in the project): The DAG includes sources, tests, and ephemeral models, not just materialized tables.
C (A lineage diagram of SQL queries only in the models directory): The DAG includes sources, seeds, and snapshots as well, not just models.
D (The list of all seeds and snapshots in the project): Seeds and snapshots are part of the DAG, but it also includes sources, models, and tests.
Category: Advanced Configurations (Hooks and Packages)
What is the purpose of pre- and post-hooks in DBT?
A. To customize the behavior of DBT models before and after their execution.
B. To enforce primary key constraints on incremental models.
C. To schedule snapshots and seeds automatically.
D. To define dependencies between models and sources in the DAG.
Correct Answer: A. To customize the behavior of DBT models before and after their execution. 🎉
Explanation:
Pre-hooks and Post-hooks in DBT allow you to execute custom SQL statements before or after a model is run. They are primarily used for customizations such as logging, auditing, or performing operations like granting permissions.
How Pre- and Post-Hooks Work:
Pre-Hook: Runs before the model’s SQL logic is executed.
Post-Hook: Runs after the model’s SQL logic is executed and materialized.
Defining Hooks in DBT:
You can configure hooks at two levels:
-
Per Model:
sql
{{ config(materialized=’table’,
pre_hook=”INSERT INTO audit_log (event, model_name) VALUES (‘start’, ‘my_model’)”,
post_hook=”INSERT INTO audit_log (event, model_name) VALUES (‘end’, ‘my_model’)”
) }}
SELECT id, name, created_at
FROM source(‘raw’, ‘customers’) -
Globally (in dbt_project.yml):
yaml
models:
pre-hook:
- “BEGIN TRANSACTION”
post-hook:
- “COMMIT”
Common Use Cases for Hooks:
* Auditing: Log start/end times or status of model runs in an audit table.
* Permissions: Grant table/view access to specific users after creation.
* Database-Specific Commands: Run custom SQL statements like vacuuming, indexing, or statistics collection.
Why the Other Options Are Incorrect:
B (To enforce primary key constraints on incremental models): DBT does not enforce database constraints; it manages data logic.
C (To schedule snapshots and seeds automatically): Scheduling is handled externally (e.g., via Airflow or DBT Cloud), not by hooks.
D (To define dependencies between models and sources in the DAG): Dependencies are defined using ref() and source(), not hooks.
What are two functions of a marts model in dbt?
a. Reference upstream sources using the source macro
b. Perform light transformations on the raw data set
c. Apply business logic for stakeholders
d. Reference upstream models using the ref macro
Analysis of Options:
A. Reference upstream sources using the source macro: This is incorrect. Marts models typically do not directly reference raw sources using the source macro. Instead, they rely on upstream models.
B. Perform light transformations on the raw data set: This is incorrect. Light transformations are typically done in staging models, not marts models.
C. Apply business logic for stakeholders: This is correct. Marts models are specifically designed to include business logic that aligns with stakeholder requirements, such as creating metrics or KPIs.
D. Reference upstream models using the ref macro: This is correct. Marts models use the ref macro to pull data from upstream models, including staging or intermediate models.
Correct Answers: C and D
- Apply business logic for stakeholders.
- Reference upstream models using the ref macro.