Advanced Level Flashcards
Category: Real-World Scenarios for Pipeline Reliability
You notice that a critical source table has not been updated as expected. Which DBT feature would you use to monitor its freshness?
A. DBT tests for null and unique constraints.
B. The dbt snapshot command to track changes.
C. The dbt source freshness command.
D. Incremental materialization to capture updates efficiently.
Correct Answer: C. The dbt source freshness command. 🎉
Explanation:
The dbt source freshness command is specifically designed to monitor the freshness of source tables. It checks whether the data in a source table has been updated recently based on a timestamp column (e.g., updated_at).
How to Monitor Freshness:
Define Freshness Criteria in schema.yml:
yaml
~~~
version: 2
sources:
- name: raw_data
schema: raw
tables:
- name: customers
loaded_at_field: updated_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
~~~
- loaded_at_field: The timestamp column that tracks when the table was last updated.
- warn_after: Issues a warning if data is older than the specified threshold.
- error_after: Throws an error if data is stale beyond the threshold.
Run the Freshness Check:
bash
dbt source freshness
Output Example:
Source freshness:
|-------------|-------------|---------------------|---------| | raw_data | customers | 2024-06-16 08:00:00 | WARN |
Source | Table | Max loaded_at | Status |
Why the Other Options Are Incorrect:
A (DBT tests for null and unique constraints): Tests validate data quality (e.g., uniqueness, not null) but do not monitor freshness.
B (The dbt snapshot command to track changes): Snapshots track historical changes to data but do not check its freshness.
D (Incremental materialization to capture updates efficiently): Incremental models optimize processing of updates but do not monitor when the source data was last updated.
Category: Incremental Logic and Schema Management
A new column, customer_status, is added to an upstream table, causing your incremental model to fail due to a schema mismatch. What configuration resolves this issue without dropping the target table?
A. Set on_schema_change=’sync_all_columns’ in the model configuration.
B. Add the customer_status column to the unique_key configuration.
C. Run dbt clean to clear compiled artifacts and rerun the model.
D. Add a post-hook to manually alter the table schema after the model runs.
Correct Answer: A. Set on_schema_change=’sync_all_columns’ in the model configuration. 🎉
Explanation:
In DBT, when a new column is added to an upstream source, incremental models may fail if the target table schema does not match the source table schema. The on_schema_change configuration tells DBT how to handle these schema changes.
Key Options for on_schema_change:
- ignore (Default): DBT ignores schema changes and processes only the defined columns.
This can cause issues if queries depend on the new column.
- append_new_columns: DBT adds new columns from the source to the target table but does not remove columns.
Useful for additive schema changes. - sync_all_columns (Correct Answer): DBT syncs the entire schema of the target table with the source.
Adds new columns and drops unused ones.
How to Apply This Configuration:
Update your incremental model configuration as follows:sql
~~~
{{ config(
materialized=’incremental’,
unique_key=’id’,
on_schema_change=’sync_all_columns’
) }}
SELECT id, name, customer_status, updated_at
FROM source(‘raw_data’, ‘customers’)
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
~~~
- on_schema_change=’sync_all_columns’ ensures that the new customer_status column is added to the target table schema automatically.
Why the Other Options Are Incorrect:
B (Add the customer_status column to the unique_key configuration): The unique_key identifies rows for upserts in incremental models; it does not resolve schema mismatches.
C (Run dbt clean to clear compiled artifacts and rerun the model): dbt clean removes compiled files but does not handle schema changes.
D (Add a post-hook to manually alter the table schema): While possible, this is inefficient and does not align with DBT’s automated schema management.
Category: Advanced Configurations - Packages
What does the dbt deps command do?
A. Installs packages defined in packages.yml.
B. Updates all models in the target schema.
C. Validates the database connection before running models.
D. Removes stale compiled artifacts from the project.
Correct Answer: A. Installs packages defined in packages.yml. 🎉
Explanation:
The dbt deps command installs all DBT packages specified in the packages.yml file. These packages often include reusable macros, models, or utilities that extend the functionality of your DBT project.
How dbt deps Works:
- Add packages to the packages.yml file:
yaml
~~~
packages:
- package: dbt-labs/dbt_utils
version: 1.0.0
- package: dbt-labs/codegen
version: 0.9.0
~~~
- Run the
dbt deps
command: - DBT installs the packages into the dbt_packages/ directory, making them available for use in your project.
Example:
- Using the dbt_utils package, which contains macros for common SQL patterns:
- Call a Macro from the Installed Package:
sql
~~~
SELECT *
FROM {{ dbt_utils.surrogate_key([‘id’, ‘email’]) }}
~~~
Why the Other Options Are Incorrect:
B (Updates all models in the target schema): This is done by the dbt run command.
C (Validates the database connection before running models): This is handled by the dbt debug command.
D (Removes stale compiled artifacts from the project): This is done by the dbt clean command.
Category: Real-World Scenarios for Pipeline Reliability
You need to add custom logic to run after every model execution in your project. What feature should you use?
A. Model-specific pre-hooks in schema.yml.
B. Global post-hooks in dbt_project.yml.
C. A custom macro executed with dbt run-operation.
D. A separate DAG defined for custom tasks.
Correct Answer: B. Global post-hooks in dbt_project.yml. 🎉
Explanation:
Global post-hooks in dbt_project.yml allow you to define SQL logic that will automatically execute after every model run across your project. This is ideal for tasks such as logging, granting permissions, or maintaining audit tables.
How to Define Global Post-Hooks:
- Edit your dbt_project.yml file to include a global post-hook:
yaml
~~~
models:
post-hook:
- “INSERT INTO audit_log (model_name, run_time) VALUES (‘{{ this.name }}’, CURRENT_TIMESTAMP)”
- “GRANT SELECT ON {{ this }} TO analytics_team”
~~~
- Every time a model is run, DBT will execute the SQL statements defined in the post-hook after the model finishes execution.
Use Cases for Global Post-Hooks:
- Auditing: Log metadata (e.g., model name, run time, user) after execution.
- Permissions: Automatically grant or revoke database permissions on newly created tables/views.
- Database Maintenance: Run operations like vacuuming or indexing after materialization.
Why the Other Options Are Incorrect:
A (Model-specific pre-hooks in schema.yml): Pre-hooks apply to individual models and run before the model execution, not after.
C (A custom macro executed with dbt run-operation): Macros are for one-off, ad hoc tasks and don’t automatically execute after every model.
D (A separate DAG defined for custom tasks): DBT’s DAG handles dependencies, not custom logic. A DAG cannot include post-execution logic.
Category: Advanced Configurations - Snapshots
What is the primary purpose of DBT snapshots?
A. To track historical changes in source data over time.
B. To create temporary tables for transformations.
C. To update incremental models automatically.
D. To monitor data freshness in source tables.
Correct Answer: A. To track historical changes in source data over time. 🎉
Explanation:
The primary purpose of DBT snapshots is to track and capture historical changes in source data over time. This is especially useful for Slowly Changing Dimensions (SCDs) or when you need to maintain a versioned history of changes in your data.
How Snapshots Work in DBT:
Define a Snapshot: Snapshots are defined in .sql files using the snapshot macro:sql
~~~
{% snapshot customer_snapshot %}
{{ config(
target_schema=’snapshots’,
unique_key=’id’,
strategy=’timestamp’,
updated_at=’updated_at’
) }}
SELECT id, name, updated_at
FROM {{ source(‘raw_data’, ‘customers’) }}
{% endsnapshot %}
~~~
-
unique_key
: Specifies the unique identifier for each row. -
strategy = timestamp
: Tracks changes using the updated_at column.
Run the Snapshot: Use the dbt snapshot
command to capture data changes over time:
Result:
DBT stores the historical versions of rows, recording:
- dbt_valid_from
: When the row became valid.
- dbt_valid_to
: When the row was replaced (or NULL for current rows).
Why the Other Options Are Incorrect:
B (To create temporary tables for transformations): Temporary tables are created with ephemeral materialization, not snapshots.
C (To update incremental models automatically): Incremental models handle updates using materialized tables, not snapshots.
D (To monitor data freshness in source tables): Data freshness is checked using the dbt source freshness command.
Category: Real-World Troubleshooting
A DBT model is running much slower than expected. Upon investigation, you notice that it is processing the entire table instead of only new or updated rows. What is the most likely cause?
A. The unique_key is not defined in the model configuration.
B. The model does not include incremental logic with is_incremental().
C. The model is materialized as table instead of incremental.
D. The dbt seed command was run before executing the model.
Correct Answer: C. The model is materialized as table instead of incremental. 🎉
Explanation:
When a DBT model is materialized as table
, the entire table is rebuilt during every run, regardless of whether there are new or updated rows. This behavior causes significant performance issues, especially for large datasets.
To process only new or updated rows, the model must be materialized as incremental
and include incremental logic.
How Incremental Materialization Resolves This Issue:
- Set Materialization to incremental:sql
~~~
{{ config(materialized=’incremental’) }}
SELECT id, name, updated_at
FROM source(‘raw_data’, ‘customers’)
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
~~~
- Initial Run: DBT builds the entire table because is_incremental() evaluates to False.
- Subsequent Runs: Only rows where updated_at is newer than the last processed value are selected because is_incremental() evaluates to True.
Why the Other Options Are Incorrect:
A (The unique_key is not defined in the model configuration): While unique_key ensures rows with the same key are updated rather than duplicated, its absence does not force full table processing.
B (The model does not include incremental logic with is_incremental()): Incremental logic (is_incremental()) is essential, but the root cause here is the wrong materialization (table instead of incremental).
D (The dbt seed command was run before executing the model): Running dbt seed does not impact how models are materialized or executed.
Category: Jinja and Macros
What is the primary use of the dbt_utils.star() macro?
A. To reference all columns in a source table, except specific excluded ones.
B. To create a surrogate key for tables with multiple unique columns.
C. To generate a list of all models in the DBT project.
D. To add metadata to models during execution.
The correct answer is A. To reference all columns in a source table, except specific excluded ones. ❌
Explanation:
The dbt_utils.star() macro is used to dynamically reference all columns in a table or model, with the option to exclude specific columns. This is particularly useful when you want to select most columns but omit a few without explicitly listing all column names.
How dbt_utils.star() Works:sql
~~~
{{ dbt_utils.star(from=ref(‘your_model’), except=[“column_to_exclude”]) }}
~~~
- Automatically selects all columns from the specified table or model.
- Excludes columns listed in the except parameter.
Why This is Useful:
- Dynamic Column Selection: No need to explicitly list all columns when the schema changes frequently.
- Exclusion Logic: Allows easy exclusion of sensitive or unnecessary columns.
Why the Other Options Are Incorrect:
B (To create a surrogate key for tables with multiple unique columns): Surrogate keys can be created using the dbt_utils.surrogate_key() macro, not star().
C (To generate a list of all models in the DBT project): Generating model lists is done through CLI commands like dbt ls, not with star().
D (To add metadata to models during execution): Metadata addition is handled in schema.yml or custom configurations, not by the star() macro.
Category: Incremental Logic and Real-World Scenarios
In an incremental model, what happens if you run the model with the –full-refresh flag?
A. Only new or updated rows are processed.
B. The entire table is rebuilt from scratch.
C. Existing rows are updated without inserting new rows.
D. Rows are appended to the table without duplicates.
Correct Answer: B. The entire table is rebuilt from scratch. 🎉
Explanation:
The --full-refresh
flag forces DBT to rebuild the entire table for an incremental model. This means all rows in the model’s query are reprocessed, and the target table is dropped and recreated.
Key Behavior with –full-refresh:
Without --full-refresh
:
- Incremental logic (e.g., is_incremental() conditions) ensures that only new or updated rows are processed.
- - With --full-refresh
:
- DBT ignores incremental logic and rebuilds the entire table from scratch.
- The target table is dropped and recreated before inserting all rows.
When to Use –full-refresh:
- Schema Changes: When the schema of the target table changes, and you need to rebuild the table.
- Data Fixes: To correct issues with historical data in the target table.
- Initial Run: When running an incremental model for the first time.
Why the Other Options Are Incorrect:
A (Only new or updated rows are processed): This is the default behavior of incremental models without the –full-refresh flag.
C (Existing rows are updated without inserting new rows): This describes merge functionality with unique_key, not a full refresh.
D (Rows are appended to the table without duplicates): Appending rows is incremental behavior, not full-refresh behavior.
Category: Real-World Scenarios for Deployment
You want to ensure that DBT runs only a specific set of models and their dependencies. Which CLI option would you use?
A. –full-refresh
B. –select
C. –exclude
D. –test
Correct Answer: B. –select 🎉
Explanation:
The --select
CLI option allows you to specify a specific set of models and their dependencies to run in DBT. This is helpful when you want to execute only a subset of your project instead of all models.
How --select
Works:
- Basic Selection: Run a single model by its name:dbt run --select model_name
- Dependencies: Run a model and all its upstream or downstream dependencies:dbt run --select model_name+
- +: Includes all downstream models (dependencies).
- +model_name: Includes all upstream models (sources feeding into the model).
-
Wildcards: Select all models in a folder:
dbt run --select path/to/folder/*
-
Multiple Models: Run multiple models by separating them with spaces:
dbt run --select model1 model2
Why the Other Options Are Incorrect:
A (–full-refresh ): Forces a rebuild of all models, not selective execution.
C ( –exclude ): Excludes specific models but does not directly select models to run.
D ( –test ): Runs tests on models but does not execute the models themselves.