Real World Scenarios for pipeline reliability and deployment Flashcards by Sunny Patel

Advanced Scenario

You are using an incremental model to update a large table daily. A new column, customer_status, was added to the upstream table, but your incremental model failed with a schema mismatch error.

Which of the following steps will resolve the issue without dropping the target table?

A. Add on_schema_change=’sync_all_columns’ to the model configuration.
B. Drop the target table and rerun the incremental model.
C. Use the dbt deps command to refresh package dependencies.
D. Run dbt test to ensure data integrity before rerunning the model.

Correct Answer: A. Add on_schema_change=’sync_all_columns’ to the model configuration.

Explanation:
When using incremental models in DBT, schema mismatches can occur when new columns are added or removed in the upstream source. The on_schema_change configuration allows DBT to handle schema changes gracefully without requiring you to drop or rebuild the table.

Available Options for on_schema_change
ignore (default):

DBT ignores schema changes and continues without modifying the target table schema.
This can result in missing or misaligned columns in the target table.

append_new_columns:

DBT adds new columns to the target table but does not remove existing ones.
This is ideal when upstream changes are additive.

sync_all_columns (Correct for this scenario):

DBT synchronizes the target table schema with the upstream changes:
Adds new columns.
Removes dropped columns.

Ensures full schema alignment between source and target.

How well did you know this?

Not at all

Perfectly

You just finished developing your first model that directly references and transforms customers source data. Where in your file tree should this model live?

a. models/stg_customers.sql

b. models/staging/stg_customers.sql

c. models/marts/stg_customers.sql

d. models/sources/stg_customers.sql

The question asks where your first model that references and transforms customers source data should live in the dbt file tree. This kind of model typically belongs to the staging layer because it involves light transformations and serves as the foundation for more complex transformations.

Analysis of Options:
A. models/stg_customers.sql: This is incorrect. While this follows the naming convention for staging models (stg_), it is not organized within the proper subfolder (e.g., models/staging).

B. models/staging/stg_customers.sql: This is correct. Staging models belong in the models/staging directory according to dbt’s best practices. This folder is designed to house models that connect to raw sources and perform light transformations.

C. models/marts/stg_customers.sql: This is incorrect. Models in the marts directory are meant for final business logic transformations, not initial staging.

D. models/sources/stg_customers.sql: This is incorrect. The sources folder is not typically where models live. Instead, it houses the sources.yml file, which defines raw data sources.

Correct Answer: B. models/staging/stg_customers.sql

This placement aligns with dbt’s standard practice for organizing staging models.

How well did you know this?

Not at all

Perfectly

You want all models in a folder to be materialized as tables.

Where can you accomplish this?

a.In the model itself

b.in the dbt_project.yml file

c. in the sources.yml file

d. in the models.yml file

The question asks where you can configure all models in a folder to be materialized as tables.

Analysis of Options:
A. In the model itself: This is incorrect. While you can specify materialization for individual models in the model file itself (using {{ config(materialized=’table’) }}), this doesn’t apply to all models in a folder.

B. In the dbt_project.yml file: This is correct. You can set folder-level configurations in the dbt_project.yml file. For example, you can define the materialization for all models in a folder by adding:

yaml

models:
  your_project_name:
    folder_name:
      materialized: table

C. In the sources.yml file: This is incorrect. The sources.yml file is used for defining raw data sources, not for configuring model materializations.

D. In the models.yml file: This is incorrect. There is no models.yml file in dbt by default. Model-specific configurations are done in dbt_project.yml or directly within the model files.

Correct Answer: B. In the dbt_project.yml file

This allows you to apply materialization settings at the folder level for all models within that folder.

How well did you know this?

Not at all

Perfectly

You have just built a dim_customers.sql model that relies on data from the upstream model stg_customers.

How would you reference this model in dim_customers?

a. select * from {{ ref(‘stg_customers.sql’) }}

b. select * from {{ ref(‘stg_customers’) }}

c. select * from {{ source(stg_customers.sql) }}

d. select * from {{ source(stg_customers) }}

The question asks how to reference the stg_customers model in the newly created dim_customers model. In dbt, models are referenced using the ref macro, which ensures that dependencies between models are properly handled and enables dbt to manage their execution order.

Analysis of Options:
A. select * from {{ ref(‘stg_customers.sql’) }}: This is incorrect. When using the ref macro, you do not include the file extension (.sql). The ref function references the model name, not the file name.

B. select * from {{ ref(‘stg_customers’) }}: This is correct. The ref macro is used with the model name (without the .sql extension). This is the standard and correct way to reference upstream models in dbt.

C. select * from {{ source(stg_customers.sql) }}: This is incorrect. The source macro is used for referencing raw sources defined in a sources.yml file, not models.

D. select * from {{ source(stg_customers) }}: This is incorrect. Again, the source macro is for referencing raw data sources, not models. The correct approach for models is the ref macro.

Correct Answer: B. select * from {{ ref(‘stg_customers’) }}
This ensures proper referencing of the upstream stg_customers model in the dim_customers model.

How well did you know this?

Not at all

Perfectly

Which command will only materialize dim_customers.sqland its downstream models?

a. dbt run –select dim_customer

b. dbt run –select dim_customers

c. dbt run –select dim_customers+

d. dbt run –select +dim_customer+

The question is asking for the command that will materialize dim_customers.sql and its downstream models (those that depend on it).

Breakdown of the Options:
a. dbt run –select dim_customer:

This command has a typo in the model name (dim_customer instead of dim_customers). As a result, it would not work unless there is a separate model with the exact name dim_customer.

b. dbt run –select dim_customers:

This command will only materialize dim_customers.sql itself but not its downstream models.

c. dbt run –select dim_customers+:

This is correct. The + after the model name includes dim_customers.sql and all its downstream models.

d. dbt run –select +dim_customer+:

This command has the same typo as option a (dim_customer instead of dim_customers), so it is incorrect. Additionally, the + before and after would include both upstream and downstream models, which is beyond the scope of the question.

Correct Answer: c. dbt run –select dim_customers+

This command will materialize dim_customers.sql and all models that depend on it (downstream models).

How well did you know this?

Not at all

Perfectly

Which of the following is a benefit of using subdirectories in your models directory?

a. Subdirectories allow you to configure materializations at the folder level for a collection of models

b. Subdirectories allow you to include multiple dbt projects in a single project

c. Subdirectories allow you to explicitly build dependencies between models based on the naming of the folders

d. Subdirectories will automatically change the schema where a model is built based on the name of the folder the model is located in

The question asks for the benefit of using subdirectories in your models directory in a dbt project.

Analysis of Options:
a. Subdirectories allow you to configure materializations at the folder level for a collection of models:
- This is correct. Using subdirectories allows you to apply configurations, such as materializations, to all models within a folder in the dbt_project.yml file. For example:
yaml

models:
  project_name:
    folder_name:
      materialized: table

This feature enables logical grouping and consistent configuration of models in a subdirectory.

b. Subdirectories allow you to include multiple dbt projects in a single project:
- This is incorrect. dbt does not support having multiple dbt projects within a single project. Subdirectories are used for organizing models within a single project, not for combining multiple projects.

c. Subdirectories allow you to explicitly build dependencies between models based on the naming of the folders:
- This is incorrect. Dependencies between models are determined by how you reference them using the ref macro, not by folder names. Folder names are irrelevant to dbt when determining dependencies.

d. Subdirectories will automatically change the schema where a model is built based on the name of the folder the model is located in:
- This is incorrect. The schema where models are built is controlled by configurations in the dbt_project.yml file or within the model file, not automatically based on folder names.

Correct Answer: a.

Subdirectories allow you to configure materializations at the folder level for a collection of models
This is the primary benefit of using subdirectories in dbt, allowing for logical organization and consistent configuration of models.

How well did you know this?

Not at all

Perfectly

You are working in the dbt Cloud IDE. How do you ensure the model you have created is built in your data platform?

a. You must save the model

b. You must use the dbt run command

c. You must commit your changes to your branch

d. You must create a sql file containing a select statemet

The question asks how to ensure that the model you created in the dbt Cloud IDE is built in your data platform.

Analysis of Options:
a. You must save the model:
Saving the model is necessary to persist your changes in the dbt Cloud IDE, but it alone does not build the model in your data platform.

b. You must use the dbt run command:
This is correct. The dbt run command compiles and executes your SQL code, materializing the model in your data platform. Without running this command, the model won’t be built.

c. You must commit your changes to your branch:
While committing changes is important for version control, it does not directly trigger the build process in your data platform.

d. You must create a SQL file containing a select statement:
Writing a SQL file is the initial step in creating a model, but this alone does not ensure that the model is built in your data platform. It needs to be executed using the dbt run command.

Correct Answer: b.
- You must use the dbt run command
- This command ensures the model is compiled and materialized in your data platform.

How well did you know this?

Not at all

Perfectly

What are two functions of a staging model in dbt?

a. Perform light transformations on your data set
b. Connect to upstream sources using the source macro
c. Connect to upstream models using the ref macro
d. Perform aggregations that apply business logic

Explanation of Each Option:
a. Perform light transformations on your data set:
This is correct. Staging models are designed to perform lightweight transformations, such as renaming columns, standardizing data types, or trimming raw data to make it ready for further processing.

b. Connect to upstream sources using the source macro:
This is correct. Staging models typically use the source macro to ingest data from raw data sources defined in sources.yml. This is a key function of staging models.

c. Connect to upstream models using the ref macro:
This is incorrect. Staging models generally do not reference other models using the ref macro; they reference raw sources instead. Referencing upstream models is more common in intermediate or marts models.

d. Perform aggregations that apply business logic:
This is incorrect. Staging models are not used for applying business logic or performing aggregations. Those responsibilities are typically handled by marts models or other layers downstream.

Correct Answer: A. Staging Model Functions:
- Perform lightweight transformations.
- Serve as the initial layer that connects to raw (upstream) sources using the source macro.

How well did you know this?

Not at all

Perfectly

What are two functions of a marts model in dbt?

Correct Answer

a. Reference upstream sources using the source macro
b. Perform light transformations on the raw data set
c. Apply business logic for stakeholders
d. Reference upstream models using the ref macro

Explanation of Each Option:
a. Reference upstream sources using the source macro:
This is incorrect. Marts models generally do not directly reference raw sources via the source macro. Instead, they depend on upstream staging or intermediate models using the ref macro.

b. Perform light transformations on the raw data set:
This is incorrect. Light transformations are typically performed in staging models, not marts models. Marts models focus on aggregations and applying business logic.

c. Apply business logic for stakeholders:
This is correct. Marts models are designed to contain business logic that aligns with stakeholder requirements, such as creating metrics, KPIs, or aggregations tailored for reporting and analytics.

d. Reference upstream models using the ref macro:
This is correct. Marts models use the ref macro to pull data from upstream staging or intermediate models. This ensures dependencies are properly managed and allows the marts layer to build upon transformations performed earlier.

Marts Model Functions:
- Apply business logic for analytics and stakeholder reporting.
- Reference upstream models (e.g., staging or intermediate layers) using the ref macro.

How well did you know this?

Not at all

Perfectly

You are auditing your dbt project’s DAG. Where will sources appear?

a. Sources are the last node in a DAG, all the way to the right.

b. Sources are the second node in a DAG, connected to the raw data.

c. Sources are the first node in a DAG, all the way to the left.

d. Sources appear directly to the right of a model in the DAG.

Correct Answer: C. Sources are the first node in a DAG, all the way to the left.

Explanation of Each Option:
a. Sources are the last node in a DAG, all the way to the right:
This is incorrect. Sources are at the beginning of the data pipeline, not the end. Models, especially marts models, are typically the last nodes in the DAG.

b. Sources are the second node in a DAG, connected to the raw data:
This is incorrect. Sources represent the raw data itself, so they are the first node in the DAG, not the second.

c. Sources are the first node in a DAG, all the way to the left:
This is correct. In a dbt project’s DAG, sources represent the raw data defined in sources.yml. They are the starting point of the data pipeline and appear on the far left.

d. Sources appear directly to the right of a model in the DAG:
This is incorrect. Sources are not dependent on models—they are the origin of the data pipeline, so they appear before any models.

Key Takeaway:

Sources are always the starting point of your dbt project’s DAG, positioned at the far left, representing the raw data coming into your pipeline.

How well did you know this?

Not at all

Perfectly

Consider this yaml file.

models/staging/jaffle_shop/stg_jaffle_shop.yml

version: 2

sources:
  - name: tpch
    database: snowflake_sample_data
    schema: tpch_prod
    tables:
      - name: orders
      - name: customers
      - name: order_items

How will you correctly select all from customers utilizing the source macro?

a. select * from {{source(‘snowflake_sample_data’,’customers’)

b. select * from {{ source(‘tpch’,’customers’) }}

c. select * from {{ source(‘tpch_prod’,’customers’) }}

d. select * from {{ source(‘customers’,’tpch’) }}

The source macro in dbt is used to reference data sources defined in a sources.yml file. Here’s a breakdown of how to correctly use it based on the provided YAML configuration:

Key Points from the YAML File:
- Source Name: tpch (defined in the sources section under name)
- Schema: tpch_prod (defined in the schema field)
- Tables: Includes orders, customers, and order_items. The name field of each table (e.g., customers) is what you use in the source macro.

Analysis of Options:
a. select * from {{ source(‘snowflake_sample_data’, ‘customers’) }}:
This is incorrect. While snowflake_sample_data is the database, the source macro uses the source name (tpch) and the table name.

b. select * from {{ source(‘tpch’, ‘customers’) }}:
This is correct, as it properly references the source name (tpch) and table name (customers).

c. select * from {{ source(‘tpch_prod’, ‘customers’) }}:
This is incorrect. tpch_prod is the schema, but the source macro requires the source name and table name, not the schema.

d. select * from {{ source(‘customers’, ‘tpch’) }}:
This is incorrect. The arguments are reversed, with the table name and source name in the wrong order.

How well did you know this?

Not at all

Perfectly

You have a table in your data platform called raw.jaffle_shop.orders.

You are attempting to run this select statement to preview the ‘orders’ source:

select * from {{ source( 'jaffle_shop', 'orders') }}

This statement is not running. Examine this YAML configuration :

sources:
  - name: jaffle_shop
    database: jaffle_shop
    schema: raw  
    tables:
      - name: orders
      - name: customers

What is the problem with this YAML file?

a.The source name is jaffle_shop, and should replace ‘raw’ in the source macro.

b. The database and the schema field should be swapped.

c. The schema name and database name should match.

d. The source macro should also include the database name in addition to the schema name.

Explanation:
The source macro references a source defined in a sources.yml file, which includes a name, database, schema, and the tables. When you call the source macro, you use the source name and the table name as defined in the YAML.

Analyzing the YAML Configuration:
- Source Name: jaffle_shop (defined under name)
- Database: jaffle_shop (defined under database)
- Schema: raw (defined under schema)
- Tables: Includes orders and customers.

The correct source macro should use:
sql

select * from {{ source('jaffle_shop', 'orders') }}

This matches the source name (jaffle_shop) and table name (orders) defined in the YAML.

Why the Query Isn’t Running:
The problem is not with the macro syntax itself but with a mismatch between the table’s actual location in the database (raw.jaffle_shop.orders) and the YAML configuration. The schema and database fields are properly configured in the YAML, and the source name jaffle_shop must be correctly referenced in the source macro instead of replacing it with the schema name raw.

Analysis of Options:
a. The source name is jaffle_shop, and should replace ‘raw’ in the source macro:
This is correct. The source macro should reference the source name (jaffle_shop) from the YAML file, not the schema (raw).

b. The database and the schema field should be swapped:
This is incorrect. The YAML correctly specifies the database as jaffle_shop and the schema as raw. There is no need to swap them.

c. The schema name and database name should match:
This is incorrect. The database and schema names do not need to match. In fact, it’s common for the schema (raw) to differ from the database (jaffle_shop).

d. The source macro should also include the database name in addition to the schema name:
This is incorrect. The source macro only requires the source name and table name. The database and schema are defined in the YAML and are not specified in the macro.

How well did you know this?

Not at all

Perfectly

You have written the following source file, but are encountering a syntax error:

models/staging/jaffle_shop/stg_jaffle_shop.yml

version: 2

sources:
  - name: analytics
    database: raw
    schema: analytics
    tables:
        - name: agg_customer_orders\_\_all_time
      - name: agg_regions_segments

What went wrong?

a. The names of tables should come before the name of the source.

b. The line identifying agg_customer_orders_all_time is indented one tab too far.

c. The database and schema names should always match.

d. The “tables” key should be indented to match the indentation of agg_customer_orders\_\_all_time.

Explanation:
In dbt YAML files, proper indentation is critical for defining sources and their tables. The error in the YAML file lies in the misalignment of the - name line for the agg_customer_orders__all_time table. It is indented incorrectly, which causes a syntax error.

Correct YAML Structure:
The tables key should define a list, with each table entry starting with - name at the same indentation level. Here’s how it should look:

yaml

sources:
  - name: analytics
    database: raw
    schema: analytics
    tables:
      - name: agg_customer_orders\_\_all_time
      - name: agg_regions_segments

Why Option b is Correct:
The agg_customer_orders__all_time table’s line is indented one tab too far, making it invalid YAML. Fixing the indentation so that it aligns with the other table entries resolves the error.

Analysis of Other Options:

a. The names of tables should come before the name of the source:
This is incorrect. In dbt, the source (name: analytics) is defined first, followed by the tables key, which lists all tables under that source. The order here is correct.

c. The database and schema names should always match:
This is incorrect. The database (raw) and schema (analytics) can be different, and there’s no requirement for them to match.

d. The “tables” key should be indented to match the indentation of agg_customer_ordersall_time:
This is incorrect. The tables key is at the correct level of indentation. The problem lies with the agg_customer_orders__all_time entry being over-indented.

How well did you know this?

Not at all

Perfectly

What is the purpose of declaring a source in dbt?

a. Tells dbt the pre-existing data to query from the data platform.

b. Tells dbt what data to copy from the data platform into dbt.

c. Tells dbt what data should be stored in tables and not views.

d. Tells dbt where to store your transformation results.

Explanation:
In dbt, a source is used to define and reference pre-existing data that resides in your data platform. These are typically raw datasets or external tables that dbt does not create or manage but uses as inputs for transformations.

Purpose of Declaring a Source:
* It maps raw data in the data warehouse so that dbt can query it using the source macro.
* It serves as a reference point for raw data tables and allows dbt to document and test these inputs.

Analysis of Options:
a. Tells dbt the pre-existing data to query from the data platform:
This is correct. Declaring a source in dbt tells it where to find the raw data in the data platform, allowing you to query and transform it in downstream models.

b. Tells dbt what data to copy from the data platform into dbt:
This is incorrect. dbt does not copy raw data into dbt. Instead, it queries and transforms it where it resides in the data platform.

c. Tells dbt what data should be stored in tables and not views:
This is incorrect. Source declarations are unrelated to materialization (tables vs. views). Materialization settings are applied to dbt models, not sources.

d. Tells dbt where to store your transformation results:
This is incorrect. Source declarations specify where to find raw data, not where transformation results are stored. Transformation results are controlled by model configurations.

How well did you know this?

Not at all

Perfectly

How well did you know this?

Not at all

Perfectly

Which is not a feature of testing in dbt?

a. Identify errors in my data pipelines when they occur.

b. Create trust in the data products we share with stakeholders.

c. Determine if the sql we wrote has done what we intended.

d. Enable stakeholders to self-serve and understand how a model was built

Study These Flashcards

Correct Answer:
d. Enable stakeholders to self-serve and understand how a model was built

Explanation of Each Option:
a. Identify errors in my data pipelines when they occur:
This is a feature of dbt testing. By using tests (e.g., schema tests or data quality tests), dbt helps identify errors in your data pipelines, such as missing values, duplicates, or constraints violations.

b. Create trust in the data products we share with stakeholders:
This is a feature of dbt testing. When tests validate the accuracy and quality of your data, they help ensure stakeholders can trust the data products.

c. Determine if the SQL we wrote has done what we intended:
This is a feature of dbt testing. Tests help confirm that transformations produce the expected results and meet the intended logic.

d. Enable stakeholders to self-serve and understand how a model was built:
This is not a feature of testing. While dbt’s documentation and lineage graphs provide transparency into how models are built, this functionality is separate from testing. Testing focuses on ensuring data correctness and integrity.

Consider the YAML configuration below. Assume this is the only YAML file in the models directory.

sources:
  - name: salesforce
    tables:
      - name: accounts
        columns:
          - name: account_id
            tests:
              - unique

What tests will be run in your project?

a. A unique test will be run on the accounts source table.

b. A unique test will be run on the salesforce source table.

c. A unique test will be run on the salesforce model

d. A unique test will be run on the accounts model

Study These Flashcards

Correct Answer:
a. A unique test will be run on the accounts source table.

Explanation:
The YAML configuration defines a source named salesforce with a table called accounts. A unique test is specified on the account_id column within this table.

Key details:

This YAML applies tests to the accounts source table, not a dbt model.
Since sources are explicitly defined and include the test, dbt will validate the uniqueness of the account_id column in the source table, not in any downstream models.

Analysis of Options:
a. A unique test will be run on the accounts source table:
This is correct. The test is applied to the accounts table in the salesforce source as defined in the YAML.

b. A unique test will be run on the salesforce source table:
This is incorrect. The test is applied to the accounts table, not the entire salesforce source.

c. A unique test will be run on the salesforce model:
This is incorrect. The YAML file applies the test to a source table, not a dbt model.

d. A unique test will be run on the accounts model:
This is incorrect. The test is on the accounts table within the source, not on any dbt model.

When a test is run, what is happening under the hood in dbt?

a. dbt will compile python to run against models or sources

b. dbt will compile SQL to run against models or sources

c. dbt will enforce a constraint on the column of the models or sources

d. dbt will scan the SQL of your data models to ensure the data materializes correctly

Study These Flashcards

Correct Answer:
b. dbt will compile SQL to run against models or sources

Explanation:
When a test is run in dbt, it compiles a SQL query that executes against the database to check the condition specified by the test (e.g., uniqueness, not null, relationships). The test’s logic is translated into SQL, executed, and the results indicate whether the test passes or fails.

What Happens Under the Hood:
1. dbt reads the test configuration (e.g., unique, not_null).
2. It generates SQL queries for these tests (e.g., a query to check for duplicate rows for a unique test).
3. The compiled SQL is executed against the database.
4. The results are evaluated, and any failures are reported.

Analysis of Options:
a. dbt will compile Python to run against models or sources:
This is incorrect. dbt is a SQL-first tool, and tests are implemented by generating SQL, not Python.

b. dbt will compile SQL to run against models or sources:
This is correct. Tests in dbt are translated into SQL queries that validate conditions on models or sources.

c. dbt will enforce a constraint on the column of the models or sources:
This is incorrect. dbt tests do not enforce database constraints. They validate data conditions by querying the database but do not modify or enforce rules on the database itself.

d. dbt will scan the SQL of your data models to ensure the data materializes correctly:
This is incorrect. dbt does not validate the correctness of your SQL or check if data materializes correctly during testing. It focuses on validating data properties (e.g., uniqueness or relationships).

On Monday, you are working in development and run dbt build. Your entire project materializes and tests successfully. You have only accepted values tests on your sources and models.

On Tuesday, you log back in and run dbt run and your models all run. You then run dbt test and find that 5 tests failed.

What is most likely the reason for the tests failing?

a. The SQL in your models resulted in a compilation error.

b. The column name in one of your sources that changed.

c. A new value was introduced on a column you were testing.

d. A column you were testing now has a duplicate.

Study These Flashcards

Correct Answer:
C. A new value was introduced on a column you were testing.

Explanation:
The scenario describes a situation where the project built successfully on Monday (including all tests passing), but on Tuesday, tests for accepted values failed. This points to the most likely cause being a new, unexpected value appearing in a column that is being tested with an accepted values test. This test checks that only specific, predefined values are present in a column.

Why Other Options Are Less Likely:
a. The SQL in your models resulted in a compilation error:
This is unlikely. Compilation errors would prevent the models from running at all, but the prompt states that the models ran successfully, indicating no compilation issues.

b. The column name in one of your sources changed:
This is unlikely. A change in a column name in a source would cause an error during the run phase because dbt would be unable to find the column. The fact that models ran successfully indicates that column names remained intact.

c. A new value was introduced on a column you were testing:
This is likely. An accepted values test ensures that only specific values are allowed in a column. If a new, unexpected value was introduced in the source data, this would cause the test to fail.

d. A column you were testing now has a duplicate:
This is unlikely. A duplicate value would cause a failure in a unique test, but the scenario specifies that you only have accepted values tests.

What is the dbt best practice for testing your primary keys?

a. Apply a unique test to your primary keys.

b. Apply a not_null test to your primary keys.

c. Apply a unique and not_null test to your primary keys.

d. Apply a relationships test to your primary keys to ensure referential integrity to a foreign key.

Study These Flashcards

Correct Answer:
c. Apply a unique and not_null test to your primary keys.

Explanation:
A primary key must uniquely and completely identify each row in a table. To ensure this, it must meet two conditions:
* Uniqueness: Each primary key value must be distinct.
* Not Null: A primary key cannot have null values, as nulls would prevent rows from being uniquely identified.

In dbt, best practice is to apply:
* A unique test to ensure no duplicate values exist.
* A not_null test to ensure every row has a value in the primary key column.

Analysis of Options:
a. Apply a unique test to your primary keys:
This is partially correct. Uniqueness is necessary but not sufficient. You must also ensure the column is not null.

b. Apply a not_null test to your primary keys:
This is partially correct. While a primary key cannot be null, this alone does not guarantee uniqueness.

c. Apply a unique and not_null test to your primary keys:
This is correct. Both tests are necessary to enforce the properties of a primary key.

d. Apply a relationships test to your primary keys to ensure referential integrity to a foreign key:
This is incorrect. While relationship tests validate referential integrity, they are applied to foreign keys, not primary keys. The focus for primary keys is ensuring uniqueness and absence of nulls.

Assume your project only has models and sources and tests configured on models and sources. (i.e. there are not snapshots or seeds – these are beyond the scope of this quiz)

How does the dbt build command work?

a. dbt build will first test your sources, then materialize all your models, and then test all your models.

b. dbt build will first test your sources, then materialize and test each model in DAG order.

c. dbt build will first materialize your models, then test your sources, and then test your models.

Study These Flashcards

Correct Answer:
b. dbt build will first test your sources, then materialize and test each model in DAG order.

Explanation:
The dbt build command is a comprehensive command that performs the following steps in sequence:
* Tests sources: It runs any configured tests on your sources (e.g., schema tests such as unique or not_null).
* Materializes models: It builds each model by executing its SQL and materializing it in the database, following the dependency order in the DAG (Directed Acyclic Graph).
* Tests models: It applies tests to each model after it is materialized, also following the DAG order.
* This ensures the pipeline runs logically, verifying raw data quality (source tests) before building and testing transformed models.

Analysis of Options:
a. dbt build will first test your sources, then materialize all your models, and then test all your models:
This is incorrect. While this describes a linear process, dbt build does not materialize all models at once. It materializes and tests models in DAG order, ensuring upstream dependencies are built and tested before downstream models.

b. dbt build will first test your sources, then materialize and test each model in DAG order:
This is correct. The dbt build command adheres to the DAG structure, materializing and testing models in the correct sequence.

c. dbt build will first materialize your models, then test your sources, and then test your models:
This is incorrect. dbt build tests sources before materializing models, not after.

Which command will run tests only on sources?

a. dbt test –select sources:*
b. dbt test –select _sources.yml
c. dbt test –select staging/
d. dbt source test

Study These Flashcards

Correct Answer:
a. dbt test –select sources:*

Explanation:
The dbt test command is used to execute tests, and the –select argument allows you to specify what to test. To run tests only on sources, you use the sources:* selector, which targets all sources defined in your project.

Analysis of Options:
a. dbt test –select sources::
This is correct. The sources: selector specifically targets all sources in the project for testing.

b. dbt test –select _sources.yml:
This is incorrect. There is no valid selector for _sources.yml. You cannot directly reference a YAML file in the –select argument.

c. dbt test –select staging/:
This is incorrect. The staging/ selector targets models in the staging directory, not sources.

d. dbt source test:
This is incorrect. There is no dbt source test command. Source testing is handled using dbt test with the appropriate selector.

You are documenting the columns in your models. Several models have the same column. What documentation type guarantees these columns have a shared description?

a. Single-line string description

b. Multi-line string description

c. Doc block

d. Read-me file

Study These Flashcards

Correct Answer: C. Doc block

Explanation:
In dbt, doc blocks are reusable descriptions that can be applied to multiple columns or models. By defining a shared description in a doc block, you ensure consistency across columns or models that share the same concept, avoiding duplication and discrepancies in documentation.

How a Doc Block Works:
Define the doc block in a .yml file using the docs key:
yaml

docs:
  column_description: |
    This column contains the unique identifier for each customer.

Reference the doc block in your model or column documentation using the doc() function:

yaml

columns:
  - name: customer_id
    description: "{{ doc('column_description') }}"

By using the doc() function, the same description is shared across all references to that doc block.

Analysis of Options:
a. Single-line string description:
This is incorrect. Single-line descriptions are specific to individual columns or models and do not guarantee shared descriptions.

b. Multi-line string description:
This is incorrect. While multi-line descriptions allow more detail, they are also specific to individual columns or models.

c. Doc block:
This is correct. Doc blocks are reusable and ensure consistency in descriptions across multiple columns or models.

d. Read-me file:
This is incorrect. A read-me file provides general project-level documentation and is not tied to individual columns or models.

You have added descriptions to your dbt models.

How can you publish these descriptions in the documentation site to be viewed?

a. dbt docs serve

b. dbt docs generate

c. dbt compile

d. dbt docs

Study These Flashcards

Correct Answer: B. dbt docs generate

Explanation:
In dbt, the dbt docs generate command builds the documentation site by compiling all the descriptions, metadata, and lineage defined in your project into a set of static files. This step is necessary before you can view your updated documentation.

Once the documentation is generated, you can use dbt docs serve to host it locally and view it in a browser.

Workflow for Publishing Descriptions:
Run dbt docs generate to build the documentation site.
Run dbt docs serve to view the documentation locally.

Analysis of Options:
a. dbt docs serve:
This is partially correct. dbt docs serve starts a local web server to view the documentation, but it requires dbt docs generate to be run first to generate the content.

b. dbt docs generate:
This is correct. This command compiles and builds the documentation site, making the updated descriptions available for viewing.

c. dbt compile:
This is incorrect. dbt compile translates model SQL into executable SQL, but it does not generate or publish documentation.

d. dbt docs:
This is incorrect. There is no dbt docs command by itself. The proper commands are dbt docs generate and dbt docs serve.

Which file type are single-line and multi-line descriptions written in? a. .sql b. .yml c. .md d. .py

**Correct Answer**: **B**. .yml **Explanation**: Single-line and multi-line descriptions in dbt are written in .yml files. These files are used for configuration and documentation, including defining sources, models, columns, and their associated descriptions. **Example of Descriptions in a `.yml` File**: `yaml` ``` models: - name: customers description: "This model contains customer data." # Single-line description columns: - name: customer_id description: | This is the unique identifier for each customer. # Multi-line description It is used as a primary key across tables. ```

What aspect of the generated lineage can help you understand your data flow? a. The selector can help narrow the scope of the shown lineage, which allows you to see how your specific model is used upstream and/or downstream Correct b. Descriptions of the models listed in the lineage graph help you see exactly what each model does c. Column-level descriptions persist between models with the same column. d. Determine bottle-necks in job runs with model timing.

**Correct Answer**: **A**. The selector can help narrow the scope of the shown lineage, which allows you to see how your specific model is used upstream and/or downstream. **Explanation**: The lineage graph in dbt's generated documentation visualizes the dependencies and flow of data between models, sources, and other elements in the project. The selector feature allows you to narrow the scope of the graph to focus on a specific model or source, making it easier to understand how data moves upstream (inputs) and downstream (outputs) relative to the selected element. ## Footnote **Analysis of Options**: a. **The selector can help narrow the scope of the shown lineage, which allows you to see how your specific model is used upstream and/or downstream**: This is correct. The selector lets you focus on a single model or source, clarifying how it interacts with other elements in the data pipeline. This helps in understanding data flow and dependencies. b. **Descriptions of the models listed in the lineage graph help you see exactly what each model does**: This is partially correct, but descriptions, while helpful, are not a feature directly tied to understanding the data flow in the lineage graph. The selector is the primary tool for narrowing scope and exploring flow. c. **Column-level descriptions persist between models with the same column**: This is incorrect. Column-level descriptions are useful for documentation but do not affect the understanding of lineage or data flow in the graph. d. **Determine bottlenecks in job runs with model timing**: This is incorrect. Model timing and performance analysis are not visualized in the lineage graph. They are better explored through job logs and dbt Cloud analytics.

How do you indicate a deployment environment as a production environment? a. Add a flag to the job execution settings. b. Select `production environment` in the deployment environment settings. c. Select `production environment` in the development environment settings. d. Enter a service account's credentials in the deployment environment settings.

**Correct Answer**: b. Select production environment in the deployment environment settings. **Explanation**: In dbt, deployment environments are configured to indicate whether an environment is for production, development, or testing purposes. To mark an environment as a production environment, you explicitly configure it in the deployment environment settings by selecting the production environment option. This setting distinguishes production from non-production environments, allowing dbt to manage them differently (e.g., enforcing stricter validation, logs, and access controls). ## Footnote **Analysis of Options**: a. **Add a flag to the job execution settings**: This is incorrect. While job execution settings control how jobs run, they do not determine whether an environment is designated as production. b. **Select production environment in the deployment environment settings**: This is correct. The deployment environment settings provide the configuration to specify that an environment is a production environment. c. **Select production environment in the development environment settings**: This is incorrect. Development environment settings are for individual developers and are separate from deployment environment configurations. d. **Enter a service account's credentials in the deployment environment settings**: This is incorrect. Service account credentials are used to manage access to the data platform, but they do not define an environment as production.

What is the function of a job in dbt Cloud? a. the execution of dbt commands in development environments b. the execution of dbt commands in deployment environments c. configure the settings for where code should be run in production. d. investigate the data assets in your production environment

**Correct Answer**: **B**. the execution of dbt commands in deployment environments **Explanation**: In dbt Cloud, a job is a predefined set of dbt commands that are executed in a deployment environment (such as production or staging). Jobs automate the execution of tasks like building models, running tests, and refreshing documentation. They ensure consistency and repeatability in your workflows. **Key Functions of a Job**: * Automate the execution of dbt commands (dbt run, dbt test, etc.). * Schedule commands to run at specific intervals. * Manage execution in deployment environments, typically staging or production, separate from development environments. ## Footnote **Analysis of Options**: **a. the execution of dbt commands in development environments**: This is incorrect. Jobs are typically used in deployment environments, not development environments. Development commands are usually run manually by developers. **b. the execution of dbt commands in deployment environments**: This is correct. Jobs are specifically designed to execute dbt commands in deployment environments, such as production or staging. **c. configure the settings for where code should be run in production**: This is incorrect. While jobs execute code, configuring where code runs (e.g., production settings) is part of environment setup, not the job itself. **d. investigate the data assets in your production environment**: This is incorrect. Investigating data assets is done through dbt's documentation or data platform tools, not via jobs.

Which one of the following are true about running your dbt project in production? a. Running dbt in production should use a different repository than I used in development b. Running dbt in production requires the re-configuration of sources to refer to production data. c. Running dbt in production should use a different database schema than I used in development d. Running dbt in production only supports the `dbt run` command - it does not support `dbt test`

Correct Answer: c. Running dbt in production should use a different database schema than I used in development Explanation: When running your dbt project in production, it is a best practice to use a different database schema for production data to ensure isolation from development or staging environments. This prevents conflicts or accidental overwrites and maintains data integrity across environments. ## Footnote **Analysis of Options**: **a. Running dbt in production should use a different repository than I used in development**: This is incorrect. The same repository is typically used for development and production. Changes are promoted through branches (e.g., from development to production) within the same repository. **b. Running dbt in production requires the re-configuration of sources to refer to production data**: This is incorrect. dbt supports dynamic environments, where source configurations (like schemas or databases) are managed through environment variables. This avoids manual re-configuration when switching between development and production. **c. Running dbt in production should use a different database schema than I used in development**: This is correct. Using separate schemas (or even databases) for production and development ensures that each environment operates independently, reducing the risk of unintentional changes to production data. **d. Running dbt in production only supports the dbt run command - it does not support dbt test**: This is incorrect. dbt tests are fully supported in production and are often critical to validate data quality before promoting results to downstream systems.

When does a dbt job run? a. Every morning at 9 am, in the time zone of your choosing b. At whatever cadence you set up for the job c. When developing in the IDE and the command “dbt run” is entered in the command bar d. Only when you open a pull request

**Correct Answer**: **B**. At whatever cadence you set up for the job **Explanation**: In dbt, a job runs on a schedule that you configure in dbt Cloud. This allows you to automate workflows, such as building models, running tests, and refreshing documentation, at a cadence that suits your needs (e.g., hourly, daily, or weekly). You can also trigger jobs manually or via API integrations if needed. ## Footnote **Analysis of Options**: **a. Every morning at 9 am, in the time zone of your choosing**: This is incorrect. While you could schedule a job to run at 9 am, the timing is not fixed and depends on the specific schedule you configure. **b. At whatever cadence you set up for the job**: This is correct. Jobs run based on the schedule you configure in dbt Cloud, allowing for flexible execution times and intervals. **c. When developing in the IDE and the command “dbt run” is entered in the command bar**: This is incorrect. While dbt run can trigger a model build in development, this is not related to jobs, which are designed for deployment environments. **d. Only when you open a pull request**: This is incorrect. Jobs are not automatically tied to pull requests, although they can be triggered through CI/CD workflows that integrate with pull requests.

Real World Scenarios for pipeline reliability and deployment Flashcards

(31 cards)