Why Should you Codify your Best Practices in dbt?

Discovering a new technology is just the beginning; realizing its full potential develops competence. Since you found your way to this blog, you must have already been familiar with the dbt Cloud. Now, it’s time to take the next step and learn how to enhance your projects by codifying best practices.

In this blog, we will explore the importance of codifying best practices in dbt and provide you with practical guidance on how to do so. The blog will primarily focus on the two key components of implementing best practices: the “why” and the “how.”

Why Should you Codify Your Best Practices in dbt?

Any project that wants to grow and succeed in the real world must always follow best practices. A simple project structure that other developers and stakeholders easily understand ensures scalability. When developers follow best practices, it simplifies debugging and frees up more time to tackle complex business challenges.

Meta-Testing to Confirm Best Practices in dbt

Although there are numerous best practices to adhere to when dealing with dbt, the following areas are the main focus of this blog:

Structuring the dbt project
Manifesting with dbt project evaluator
Linting and formatting your SQL code

As you become more accustomed to these processes, you will become an expert at codifying dbt projects using best practices that make them scalable, well-structured, and tailored to your business’s analytical needs. Let’s explore each in more detail!

Structuring the dbt Project

The most important aspect of any dbt project is its structural design, which organizes project files and code in a way that supports scalability for large data warehouses. In this section, we will discuss how to codify best practices for dbt through structured approaches and walk through an example using customer and order data, where we have created multiple folders that add clarity and meaning to the analytics being performed.

Before diving into the structure, it’s important to note that models are the fundamental building blocks of dbt, where developers spend most of their time in the dbt environment. Data Analytics Engineers create models to transform data without it ever leaving the warehouse. With that in mind, there are several ways to organize your models effectively, and we will focus on how to structure data related to customers and orders.

The highlighted sections, staging, intermediate, and marts, represent folders created to provide functional meaning to raw data and distinguish between different layers of models that depend on one another.

Source and model YAML files are created separately to simplify the addition of new models or source entity configurations. These files are prefixed with an underscore for quick identification.

Staging

Folder and File Structure
1. Subdirectories in the staging folder should always be named according to their source systems. Source systems often share similar properties across different tables, making it easier to manage them together in smaller sets.
2. Model files within these subdirectories should follow the naming convention stg_<source_name>__entity.sql. The double underscore helps clearly identify the entity being defined from the source, especially when the source name contains multiple underscores.
3. Source and model YAML files are created separately to simplify the addition of new models or source entity configurations. These files are prefixed with an underscore for quick identification.

Models
The following generic example illustrates two CTEs (Common Table Expressions): one that pulls data from the source and another that applies transformations.
1. Staging models should be the only place where the source macro is referenced, establishing a 1:1 relationship between each staging model and its corresponding source.
2. These staging models should involve lightweight transformations, such as renaming columns, type casting, basic computations, or conditional logic (like CASE statements) to categorize rows.
3. Since staging models are not final artifacts, materializing them as views is the best approach to ensure that the underlying data remains up to date.
4. It is recommended to use the dbt-codegen package to automatically generate all the source YAML files and staging models, which can speed up the development process for large projects.

Intermediate

Folder and File Names
1. The intermediate layer is built on top of the staging layer, and, as the name suggests, it holds the data needed to build the final models or reports.
2. Unlike the staging subfolders, it is recommended to use business or functional names for the subdirectories under the intermediate layer to align with specific business areas.
3. While it can be challenging to assign meaningful names to intermediate model files due to the complexity of joins and aggregations involved, best practices suggest naming the models with a format like int_<verb>.sql. For example, int_orders_count.sql is located in a subfolder called finance.

Models
The example below illustrates that intermediate models do not need to be physically present in the target database. They can be utilized at runtime using the materialization defined as ephemeral.
1. Since intermediate models are not intended for final output to dashboards, they should not be exposed to end users in the main production schema.
2. These models can be used to break complex operations into smaller chunks, helping to avoid heavy computations in the final marts models.
3. By keeping complex operations isolated from the final output, these models simplify the project and enhance clarity in the project structure.

Marts

Folder and Filenames
1. It is essential to group the marts based on business or functional areas
2. These final models are exposed to end users, so naming them according to their purpose is important; for example, orders.sql in the example below.

Models
Let’s consider the example of the orders model, which serves as the final output that feeds data for dashboards and further analysis.
1. It is crucial to start with materializing the final marts model as a view, which ensures that the data in the model is always up to date.
2. If querying the data in views takes more time than expected, it would be then wise to materialize it as a table.
3. Even if the materializing it as a table takes a longer time due to the large amount of data, it makes more sense to materialize the model as incremental. It reduces build time for the table, and always acts upon the new and updated data based on the incremental strategy that is chosen.
4. It is considered a good practice to have the final models as denormalized on which analysts can run their queries to get desired results without making efforts to join them with other result sets.

Remainder of the Project Structure

Config Per Folder
1. So far, we have discussed models, and it’s important to consider where to place the configurations you want to provide for those models, such as materializations, tests, and descriptions. In an ideal scenario, it is best to have source.yml and model.yml files in each folder, making it easier for developers to manage them.
2. While defining configuration files per folder is the most standard approach, there is no harm in maintaining a separate YAML configuration file for each model. The choice ultimately depends on how you prefer to manage the project, as individual config files can sometimes make it easier for developers to locate the settings related to each model.

Cascade Configs
1. It would be wise to use dbt_project.yml to define high-level configurations for your models rather than duplicating similar configuration properties for each individual model.

Custom Generic Tests
1. Previously, the only place to define your custom generic tests was in the macros/ directory, as they function similarly to macros. Now, we have another option for adding generic tests in the tests/generic/ directory.
2. All custom generic tests should include at least one of the following standard arguments: model and column_name.
3. You may also include additional arguments in your tests, which must be specified in the test signatures.

Materializations – Based on the key metrics and functional requirements outlined below, developers can determine the most suitable materialization method.

	View	Table	Incremental
Time of Build	Fastest, always stores logic	Slowest, acts upon complete data	Faster than table, acts upon part of data
Cost of Query	Higher than table, always runs the logic to bring the data	Lower, denormalized data stays in warehouse	Lower, denormalized data stays in warehouse
Freshness	Always up to date	Based on latest run	Based on latest run

Manifesting with dbt Project Evaluator

As the name suggests, the dbt project evaluator checks if your project is well-organized, scalable, and adheres to best practices for maintaining high-quality data pipelines. The dbt project evaluator package highlights areas where your project does not meet the standards set by dbt Labs.

Below is a breakdown of the areas where the dbt project evaluator validates your project:

Modeling:
1. Direct Join to Source: Models should not reference both a source and another model. Use 1:1 direct relationship between source and staging model. It allows minor transformations that need to be done on the source data only once and consistent across the project.
2. Downstream Models Dependent on Source: Downstream models (marts or intermediate) should not directly depend on source nodes. Staging models are believed to be the atomic units for data modeling and hold transformed source data as per the requirements. Hence, referencing staging models for downstream models is considered to be legal.
3. Duplicate Sources: Multiple source nodes should not be assigned to the same database object in the warehouse to maintain accurate data lineage.
4. Hard-Coded References: Hardcoded table references in models should be avoided. It is recommended that they be replaced with ref() for models and source() for raw data. This ensures that dbt can generate the correct DAG and dependencies. It also ensures that the current model rightly selects upstream models in the same environment that you are working in.
5. Model Fanout: By default, if a model has more than three downstream models, consider refactoring. Though the value can be overridden, it may indicate that some transformations should move to the BI layer or that common business logic should be centralized upstream.
6. Multiple Sources Joined: A single model should not reference multiple sources. Create separate staging models for each source, which can then be joined in downstream models.
7. Rejoining of Upstream Concepts: Avoid redundant rejoining of models in a DAG. Consolidate SQL logic into one model to simplify the DAG and reduce build time.
8. Root Models: Models without parents in the DAG might directly reference raw data. To maintain lineage and execution order, replace raw references with ref() or source() functions.
9. Source Fanout: A source should only feed into a single staging model. Other models should reference the cleaned data from the staging model rather than the raw source.
10. Staging Models Dependent on Downstream Models: Staging models should only depend on sources. If a staging model references an intermediate or marts model, it may be misnamed or need lineage corrections.
11. Staging Models Dependent on Other Staging Models: Avoid dependencies between staging models. Review the model types or rename them to reflect their true meaning (e.g., intermediate or marts).
12. Unused Sources: Ensure all defined sources in YML files are being used in the project.

Testing:
1. Missing Primary Key Tests : It is recommended to add not null and unique tests to the primary key column for each model.
2. Test Coverage: It evaluates the whole project for test coverage and ensures every model in your project has tests applied.
Documentation:
1. Document coverage: It calculates the percentage of models that have descriptions attached to them.
2. Undocumented Models: Each model should have a description so that its purpose is clear to other developers and stakeholders.
3. Undocumented Source Tables: Each source table name should have a description.
4. Undocumented Sources: Each source should have a description to add clarity to the project.
Structure:
1. Model Naming Conventions: While querying the data warehouse, it is important to follow the naming conventions that provide meaning to the model type and help the users query the data easily.
2. Model Directories: All model files should be placed under the directories or subdirectories that match their model type. This ensures repository organization and makes it easy to locate information about the models.
3. Test Directories: Each subdirectory in the models/ folder should have a single .yml file that holds the tests and documentation for all the models in that subfolder.
Performance:
1. Chained View Dependencies: Avoid having a distance between parent and child greater than 4 (by default, but it’s configurable) where the objects are of type view or ephemeral. Instead, create upstream models as tables or incremental models to improve compilation time.
2. Exposure Parents Materializations: Exposures should not be dependent on the raw sources, but highly performant models. This means that the models on which exposures depend should be either materialized as table or incremental.
Governance:
1. Public models without contracts: Public models are designed to be consumed by any downstream application. Hence, adding contracts ensures that the data types, columns, and other constraints always remain consistent.
2. Undocumented Public Models: This check will ensure that each model configured to be accessed as public has a description at the model level and descriptions on each of its columns.
3. Exposures dependent on private models: It is highly recommended that downstream tools that read data from the models through exposure be public, trusted, and have contracts enforced against them.

All the checks we discussed validate the data in your project and insert them into corresponding fact tables. If the fact table holds some rows, that means there is a violation of best practices.

Example Project

Let’s dive into the real-life example project and evaluate it. Adding a dbt project evaluator is as simple as adding another package to your dbt project. We have added the package details under package.yml and run the dbt deps to install the package.

Now, to run the evaluation, dbt build --select package: dbt_project_evaluator stores the results in the corresponding fact tables that violate the best practices and runs tests on them, as shown below. By default, the test severity is set to warn. The details of the violations can be verified from the fact tables generated in the target data warehouse.

To view the violations is_empty_fct_undocumented_sources_ , the corresponding fact table fct_undocumented_sources can be queried as shown below.

Once all these staging models have descriptions in the YAML file, the records will no longer be present in this table from the next validation. Also, the package lets you create a custom seed file to add exceptions to the checks that can be ignored by the dbt project evaluator. Let’s look at one of the examples to add exceptions to the seed file.

We have again run the command dbt build --select package: dbt_project_evaluator to evaluate the dbt project and found few warnings as shown below.

Each of the is_empty tests that appear as warnings means that the corresponding tables or views are not empty and hold the records that violate the dbt project evaluator default expectations. To override the default expectations, we have to follow the below approach :

Verify the records in each corresponding table/view

Create a custom seed file named dbt_project_evaluator_exceptions.csv under the seeds folder and add the exceptions. fct_name, column_name, id_to_exclude, and comment are the only fields that need to be overridden for each corresponding view or a table that throws the warning.

Place the configuration in dbt_project.yml file for overriding the default expectations.

Once the above steps are followed, we will need to run dbt seed to create the exception seed file, followed by dbt build --select package: dbt_project_evaluator for evaluating the project.

As shown below, the next run for the dbt project evaluator respects the exceptions added and does not throw the warnings.

You may refer to more details here to check all rules and customizations and learn how to disable the checks for your specific use case.

Lint and Format Your SQL Code

dbt allows you to enhance your development experience by leveraging the ability to lint and format your SQL code using popular linters and formatters like SQLFluff and sqlfmt.

In the dbt Cloud IDE, you can easily check, fix, and format SQL files to keep your code clean and consistent. Use SQLFluff to lint (check for errors) and auto-fix and sqlfmt to format the code. They have their own specific rules for linting and formatting. You can also tweak these rules to make them more flexible or to improve how errors and style issues are detected.

By default, the IDE will use sqlfmt to format your SQL code. But if you create a .sqlfluff file in the root of your dbt project, the IDE will automatically switch to using SQLFluff rules instead.

The logs in the above screenshot clearly denote either sqlfmt or SQLFluff is being used for formatting the SQL code. In order to utilize SQLFluff for lint and format in IDE, it is required to add a .sqlfluff file along with your customizations (if any). The example below shows how upon the addition of .sqlfluff file and its customized rules, the format button in IDE is updated to Lint/Fix drop down.

Using the config button at the bottom right (shown in the screenshot above), you can select your preferred format.

The example above is inspired by the dbt’s Style guide. The other recommended style guides include Matt Mazur’s SQL Style Guide and GitLab’s Style Guide. Once the preferred style guide is selected, the next step is to incorporate linting into your workflow to use SqlFluff to enforce adherence to the specified guide.

Closing

It is always recommended to follow best practices while working on a dbt project. It ensures that the projects can be easily scaled for large enterprises by accommodating various business use cases. It is essential for success to adopt a ‘dbt best practices’ mindset while developing data models for downstream applications.

These insights are the result of extensive experimentation and real-world case studies supported by collaborative efforts of the broader dbt community. Together, we learn from our experiences and readily share best practices, whether related to project structuring, utilizing dbt packages to streamline project workflows, or implementing linting and formatting to maintain consistency across the project.

Looking for more real-world advice and help with your dbt projects?

The dbt experts at phData can help!
As dbt Labs 2024 Partner of the Year, phData is uniquely positioned to quicken your dbt success story. Learn more by exploring our dbt success stories!

FAQs

What are the best things to keep in mind while deciding materialization for dbt models?

For simpler logic and having the need to always query up-to-date data, use views for the materialization
Tables are better for fast querying with large datasets when complexity is involved, while views are quicker to set up but slower when queried with complex logic.
If updates to the data happen frequently and re-creating the table with historical records takes longer, opt for incremental materialization to only insert new and updated records.
If certain models don’t need to be physically stored in a data warehouse and are just built to reduce complexity for the final model, consider using ephemeral materialization
If multiple models rely on the same logic, materializing it into a table can improve overall performance.

What is the dbt evaluator package?

The dbt evaluator package lets you discover the gray areas of your project which might need a fix to adhere to the best practices.It gives you flexibility to play around by ability to disable checks and add custom exceptions to the default rules. In a nutshell it allows your project to stay organized, be performant and ensure readable code structure for better development experience.