Basics Flashcards
What is the default materialization used by DBT if you don’t specify one?
A. Table
B. View
C. Incremental
D. Ephemeral
B. View
Explanation:
In DBT, if no materialization is specified in the config block of a model, the default behavior is to create a view.
Materialization determines how DBT creates and persists a model in the database.
View: A logical query stored in the database that gets executed when queried.
Table: A physical table stored in the database with persistent data.
Incremental: A table that updates or appends data in smaller batches.
Ephemeral: A temporary result that exists only in the context of a single query.
Key Takeaway
View is the default materialization in DBT because it is lightweight and does not require extra storage.
Question 2:
What is the purpose of the on_schema_change configuration in DBT?
A. To automatically rebuild a model if the schema of upstream tables changes.
B. To allow DBT to update or replace target table columns when the schema changes.
C. To enforce strict validation rules for schema consistency.
D. To control whether views or ephemeral models are rebuilt on schema changes.
Correct Answer: B. To allow DBT to update or replace target table columns when the schema changes. 🎉
Explanation:
The on_schema_change configuration in DBT is used in incremental models to handle changes in the schema of the target table. It determines how DBT reacts when columns in the upstream source or model change (e.g., new columns are added).
Options for on_schema_change
- ignore: DBT will ignore schema changes. This is the default behavior.
- append_new_columns: DBT will add new columns to the target table without deleting or replacing the existing ones.
- fail: DBT will throw an error if there are schema changes, stopping the execution.
Key Takeaway:
The on_schema_change setting is critical for incremental models where upstream schema changes might happen. It enables DBT to adapt to those changes without manual intervention.
Example Usage:
{{ config(
materialized=’incremental’,
on_schema_change=’append_new_columns’
) }}
SELECT *
FROM my_source_table
In this case, if new columns are added to my_source_table, DBT will add those new columns to the target table when the model runs.
What does the dbt seed command do?
A. Loads CSV files into the database as tables.
B. Generates SQL scripts for data models.
C. Runs tests on the source data to ensure data quality.
D. Creates ephemeral models from raw data files.
Correct Answer: A. Loads CSV files into the database as tables.
Explanation:
The dbt seed command is used to load CSV files into the database as tables. These tables can then be referenced like any other model in DBT.
How dbt seed Works:
- Place your CSV files in the data directory of your DBT project.
Example project structure:
my_dbt_project/
├── models/
├── data/
│ ├── customers.csv
│ ├── products.csv
- Run the dbt seed command:
dbt seed
- DBT reads the CSV files and creates database tables using the data in the CSV files.
Key Takeaway:
The dbt seed command simplifies loading static CSV data into your database, making it accessible for models and transformations.
What is the purpose of the dbt snapshot command?
A. To create ephemeral models for temporary data storage.
B. To capture and track historical changes in source data over time.
C. To test and validate the freshness of source data.
D. To rebuild tables and ensure schema consistency.
Correct Answer: B. To capture and track historical changes in source data over time.
Explanation:
The dbt snapshot command is used to create snapshots of source data that track historical changes over time. This is particularly useful for slowly changing dimensions (SCDs) or auditing data changes.
How DBT Snapshots Work:
- Snapshots capture the state of a table or query at specific points in time.
- When data in the source table changes, DBT detects those changes and records them as new rows in a snapshot table.
- This allows you to analyze historical versions of data and track changes like updates or deletions.
Key Features of Snapshots:
Versioning: Snapshots help you maintain historical records of rows when values change.
Slowly Changing Dimensions: Perfect for scenarios where you need to track data over time, such as:
Customer address changes
Product price updates
What Happens When You Run dbt Snapshot?
- Initial Run: DBT creates a snapshot table and loads the data as-is.
- Subsequent Runs:
DBT compares the current source data to the existing snapshot.
If changes are detected, DBT inserts new records with timestamps to indicate the changes.
Key Takeaway:
Use dbt snapshot to track historical changes in data, enabling auditing and slowly changing dimension (SCD) analysis.
What does the dbt test command do?
A. It checks for syntax errors in the SQL models.
B. It runs assertions on the data to ensure data quality.
C. It validates database connections before running models.
D. It rebuilds all models and refreshes test cases.
The correct answer is B. It runs assertions on the data to ensure data quality.
Explanation:
The dbt test command is used to run data quality tests on your models, sources, and seeds. These tests ensure that your data meets specific criteria, such as:
- Uniqueness: No duplicate values in a column.
- Not null: A column does not contain null values.
- Referential integrity: A foreign key relationship exists between two tables.
- Custom tests: You can write custom SQL-based tests for any business rule.
Example Usage
Adding a Test to a Model:
Define tests in your schema.yml file:
version: 2
models:
- name: customers
columns:
- name: id
tests:
- unique
- not_null
- name: email
tests:
- not_null
Run Tests:
To run all tests in your project:
dbt test
Test Output
Success: The data meets the criteria (e.g., no duplicates, no nulls).
Failure: DBT identifies rows that fail the test and outputs them for review.
Why Not C?
Option C: “It validates database connections before running models” describes the dbt debug command, which checks for database connections and project setup issues.
Key Takeaway
The dbt test command ensures data quality by validating data against specified criteria.
Use schema.yml to define built-in or custom tests.
What is the purpose of the dbt build command?
A. It runs models, tests, and snapshots in the correct order.
B. It only runs incremental models in the project.
C. It rebuilds all tables and refreshes seeds.
D. It validates SQL syntax before running models.
Correct Answer: A. It runs models, tests, and snapshots in the correct order. 🎉
Explanation:
The dbt build command is an all-in-one command introduced in recent versions of DBT. It runs the following tasks in the correct dependency order:
- Models: Executes and materializes models (tables, views, incremental, ephemeral).
- Tests: Runs data quality tests on the results of models and sources.
- Snapshots: Runs snapshots to capture historical data changes.
- Seeds: Loads CSV files into the database.
How dbt build Works
Running the dbt build command in the terminal:
dbt build
DBT determines the dependency graph of your project and ensures that tasks are executed in the correct order.
It combines the behavior of multiple commands like dbt run, dbt test, and dbt snapshot.
Key Benefits of dbt build
- Efficiency: Instead of running multiple commands (dbt run, dbt test, dbt snapshot), you can use one command to execute all tasks.
- Order Management: Ensures tasks like models and snapshots are built before running tests.
- Consistency: Simplifies the build process and ensures that the output is reliable.
Why the Other Options Are Incorrect
B: “It only runs incremental models” — Incorrect. dbt build runs all models, not just incremental ones.
C: “It rebuilds all tables and refreshes seeds” — Partially correct but incomplete. dbt build does more than just rebuild tables and refresh seeds; it also runs tests and snapshots.
D: “It validates SQL syntax before running models” — Incorrect. This describes what happens during SQL compilation or the dbt compile command.
Key Takeaway
The dbt build command is a powerful, all-in-one tool to:
Run models
Execute tests
Build snapshots
Load seeds
…in the correct dependency order.
dbt deps
What is the main purpose of the dbt deps command?
A. It compiles SQL code into runnable queries.
B. It installs packages listed in the packages.yml file.
C. It updates all DBT models in the target schema.
D. It runs incremental models and refreshes seeds.
Correct Answer: B. It installs packages listed in the packages.yml file. 🎉
Explanation:
The dbt deps command is used to install and manage DBT packages specified in your project’s packages.yml file.
What are DBT Packages?
DBT packages are reusable pieces of code that you can include in your DBT project.
They can contain models, macros, tests, and other configurations that enhance your DBT project.
Packages are typically hosted on platforms like dbt Hub or GitHub.
How to Use dbt deps
Add the package(s) to the packages.yml file:
yaml
Copy code
packages:
- package: dbt-labs/dbt_utils
version: 1.0.0
Run the dbt deps command:
bash
Copy code
dbt deps
DBT will download and install the package into the dbt_packages directory in your project.
Key Takeaway
Use dbt deps to install and update DBT packages defined in packages.yml.
Packages help you reuse code and streamline project development.
Why the Other Options Are Incorrect
A: “It compiles SQL code into runnable queries” — This is done by the dbt compile command.
C: “It updates all DBT models in the target schema” — This is achieved by the dbt run command.
D: “It runs incremental models and refreshes seeds” — These tasks are managed by dbt build or specific commands like dbt seed and dbt run.
dbt clean
What does the dbt clean command do?
A. Deletes the target directory and removes compiled artifacts.
B. Removes unused models from the database schema.
C. Drops temporary views created during DBT runs.
D. Refreshes seeds and snapshots to clean outdated data.
The correct answer is A. Deletes the target directory and removes compiled artifacts.
Explanation:
The dbt clean command is used to delete directories like target/ and dbt_packages/ where DBT stores compiled artifacts and package dependencies.
What Does dbt clean Do?
The clean command removes:
* target/ directory: Contains compiled SQL and execution artifacts.
* dbt_packages/ directory: Contains downloaded DBT packages installed using dbt deps.
This command helps to clear up space or reset the project to a clean state.
Why Not C?
Option C: “Drops temporary views created during DBT runs” — Incorrect. DBT does not automatically drop temporary views during dbt clean. Temporary views are handled as part of the database’s behavior.
Key Takeaway:
Use dbt clean to remove compiled artifacts and reset your DBT project to a clean state.
This is useful when preparing for a fresh project run or troubleshooting.
dbt compile
What does the dbt compile command do?
A. Executes models and generates tables/views in the database.
B. Validates the data source configuration.
C. Translates DBT models into raw SQL files without executing them.
D. Runs tests and validates data quality.
Correct Answer: C. Translates DBT models into raw SQL files without executing them.
Explanation:
The dbt compile command translates your DBT models (written in Jinja and SQL) into raw SQL queries that are ready to be executed against your database. However, the queries are not executed—they are simply generated and stored for inspection.
How It Works:
DBT processes all your models and macros.
It resolves all Jinja syntax, including:
* ref() and source() functions.
* Loops, conditionals, and macros.
The result is a set of pure SQL files saved in the target/compiled directory.
Key Benefits:
* Debugging: You can review the compiled SQL before running models.
* Validation: Helps identify Jinja syntax errors or broken references.
* Efficiency: Allows you to generate queries without querying the database.
Why the Other Options Are Incorrect:
A: “Executes models and generates tables/views” — This describes the dbt run command.
B: “Validates the data source configuration” — This is part of dbt debug.
D: “Runs tests and validates data quality” — This describes the dbt test command.
Key Takeaway:
Use dbt compile to generate raw SQL without executing it. This is great for debugging and validating SQL logic.
ref()
What happens if you use the ref() function in a model?
A. DBT creates a temporary table for the referenced model.
B. DBT ensures dependencies are respected and the referenced model is built first.
C. DBT caches the results of the referenced model to improve query performance.
D. DBT automatically materializes the referenced model as an incremental table.
Correct Answer: B. DBT ensures dependencies are respected and the referenced model is built first.
Explanation:
The ref() function in DBT is a critical feature that allows you to reference other models. It ensures that dependencies between models are respected and handled automatically.
When you use ref():
1. DBT determines the order in which models should be built based on their dependencies.
1. The referenced model is built first before the current model runs.
1. The ref() function dynamically resolves to the correct schema and table name based on the target environment (e.g., dev, prod).
How DBT Handles Dependencies:
* DBT builds a Directed Acyclic Graph (DAG) to determine the correct execution order for models.
* This ensures there are no circular dependencies, and all upstream models are completed before downstream models.
Why the Other Options Are Incorrect:
A: “DBT creates a temporary table” — Incorrect. The behavior of the referenced model depends on its materialization (e.g., table, view, incremental).
C: “DBT caches the results” — Incorrect. DBT does not cache results; it dynamically resolves references.
D: “DBT automatically materializes the referenced model as incremental” — Incorrect. Materialization must be explicitly defined in the referenced model.
Key Takeaway:
The ref() function is fundamental for managing model dependencies. It ensures upstream models are built first and resolves schema and table names dynamically.
dbt source freshness
What does the dbt source freshness command do?
A. Checks if source data has been updated recently.
B. Refreshes source tables to match the latest data.
C. Runs tests on source tables to ensure data quality.
D. Deletes outdated rows in source tables.
Correct Answer: A. Checks if source data has been updated recently. 🎉
Explanation:
The dbt source freshness command is used to check how fresh your source data is by comparing the most recent timestamp in the source table with an expected freshness threshold.
How It Works:
1. In your schema.yml file, you define freshness criteria for your sources:
yaml
Copy code
version: 2
sources:
- name: raw
database: analytics
schema: public
tables:
- name: customers
freshness:
error_after: { count: 12, period: hour }
warn_after: { count: 6, period: hour }
loaded_at_field: updated_at
freshness: Defines acceptable time thresholds for source updates:
* error_after: If the data is older than this threshold, the test will fail.
* warn_after: If the data is older than this threshold, it will produce a warning.
loaded_at_field: The column used to determine when the data was last updated (e.g., updated_at).
- Run the freshness test:
bash
Copy code
dbt source freshness
- DBT checks the most recent value in the loaded_at_field and compares it to the current time.
Example Output:
After running the command, DBT provides a status report:
Source freshness:
| Source | Table | Max loaded_at | Status |
|————-|————-|—————-|———|
| raw | customers | 2024-06-16 | WARN |
* If the data is older than the warn threshold but within the error threshold, DBT will report a warning.
* If the data is older than the error threshold, the test fails.
Why the Other Options Are Incorrect:
* B: “Refreshes source tables to match the latest data” — Incorrect. dbt source freshness only checks freshness; it does not refresh or modify the data.
* C: “Runs tests on source tables to ensure data quality” — Incorrect. This describes dbt test for data quality, not freshness.
* D: “Deletes outdated rows in source tables” — Incorrect. DBT does not delete rows; it only tests and builds models.
Key Takeaway:
* Use dbt source freshness to monitor source data freshness and ensure your data pipelines are processing recent updates.
* This is crucial for time-sensitive analytics and reporting.
What is the purpose of the dbt debug command?
A. Compiles models to check for SQL errors.
B. Validates the DBT project setup and database connection.
C. Runs all tests and snapshots in the project.
D. Clears the target directory and refreshes packages.
The correct answer is B. Validates the DBT project setup and database connection.
Explanation:
The dbt debug command is used to validate that your DBT project is correctly configured and that DBT can successfully connect to the database.
What dbt debug Checks:
1. Profiles.yml: Ensures your profiles.yml file is configured correctly.
1. Database Connections: Verifies that DBT can connect to your target database.
1. Environment Setup: Checks the availability of required credentials, permissions, and DBT version.
1. Project Integrity: Confirms that the DBT project structure and required files are in place.
Why the Other Options Are Incorrect:
* A: “Compiles models to check for SQL errors” — This describes the dbt compile command.
* C: “Runs all tests and snapshots in the project” — This describes the dbt test and dbt snapshot commands.
* D: “Clears the target directory and refreshes packages” — This describes the dbt clean and dbt deps commands.
* Key Takeaway:
Use dbt debug when:
* Setting up a new DBT project.
* Troubleshooting issues with database connections.
* Validating the integrity of your project environment.
Which materialization type does not create any persistent objects in the database?
A. Table
B. View
C. Incremental
D. Ephemeral
Correct Answer: D. Ephemeral 🎉
Explanation:
The ephemeral materialization in DBT does not create any persistent objects (tables or views) in the database. Instead, ephemeral models are inlined as subqueries into downstream models where they are referenced.
Advantages of Ephemeral Models:
1. No Storage Overhead: Since ephemeral models don’t create tables or views, they save storage space.
1. Efficient for Small Queries: Best used for lightweight transformations that don’t need to be persisted.
1. Inline Query Execution: The SQL logic is included directly in downstream models for better performance.
Why the Other Options Are Incorrect:
* A (Table): Creates a physical table in the database.
* B (View): Creates a logical view that persists in the database.
* C (Incremental): Creates or updates a physical table in batches.
Key Takeaway:
* Ephemeral models are inlined as subqueries and do not create any persistent database objects.
* Use them for lightweight transformations that don’t need to be stored independently.
What happens when you run dbt run?
A. DBT runs all models and creates tables/views based on their materializations.
B. DBT runs all tests defined in the project.
C. DBT compiles models into raw SQL but does not execute them.
D. DBT refreshes seed data and snapshots.
Correct Answer: A. DBT runs all models and creates tables/views based on their materializations.
Explanation:
The dbt run command executes all models in your project and creates tables, views, or other database objects based on their materializations.
Key Behaviors of dbt run:
1. Executes Models: Runs all .sql files in the models/ directory that are not disabled.
2. Materializations: DBT uses the materialization specified in each model’s configuration to determine how to build it:
* Table: Creates a physical table.
* View: Creates a logical view.
* Incremental: Updates or appends to a table in small batches.
* Ephemeral: Inlines the query into downstream models (no database object created).
Dependencies: Models are executed in the correct order based on the dependency graph (determined by ref() and source relationships).
What Happens:
1. DBT compiles the models (resolving ref() and other Jinja logic).
1. It executes the SQL queries against your target database.
1. DBT creates the appropriate database objects (tables or views) based on the specified materialization.
Why the Other Options Are Incorrect:
* B: “DBT runs all tests” — This is done by the dbt test command.
* C: “DBT compiles models into raw SQL but does not execute them” — This describes the dbt compile command.
* D: “DBT refreshes seed data and snapshots” — Seeds are refreshed using dbt seed, and snapshots are run using dbt snapshot.
Key Takeaway:
dbt run executes your models and creates tables/views in the database based on their materializations.