October 24, 2024

dbt Labs’ Coalesce 2024 Recap

By Bruno Souza de Lima

Welcome to the dbt Labs’ Coalesce 2024 recap! I’m Bruno from phData—you might have seen me on LinkedIn posting about dbt. Although I’ve been very engaged with dbt and the community for a while, this was my first time attending Coalesce in person, and I’m so excited to break down all the biggest announcements from the conference! 

In this blog, I’ll share my experience attending, highlight some exciting awards, and unpack all the exciting updates from the event.

Overall Impression

It was my first time at Coalesce in person (and my first time in the US, by the way), and it was a totally different feeling than watching it online. The venue was great—a Vegas hotel with an entire floor dedicated exclusively to the conference. 

There was a large room for vendor booths where you could find all kinds of swag—from coconut water to 3D-printed hero figures of yourself and even magic tricks. There were several rooms for speaking sessions covering customer stories, dbt for practitioners, dbt for enterprises, deeper dives into announcements, AI, and much more. 

The only downside was that sometimes I wanted to watch more than one session simultaneously, but at least you can watch them on-demand. The sponsor parties were also excellent, and the Coalesce after-party exceeded my expectations!

But what I liked most was meeting in person a ton of extraordinary people I had previously only known online. I met with people from phData, dbt Labs, other companies, vendors, practitioners, and even friendly competitors from all around the world. 

I love remote work and wouldn’t change it, but it’s a great feeling to connect with people in person; it’s a different kind of communication—even though I’m terrible at recognizing people from their profile pictures.

Awards

This was the perfect Coalesce for me to attend in person because, first of all, phData won the Partner of the Year Award Overall for the second year in a row!

Secondly, I was one of the individuals who won the dbt Community Award (also for the second time in a row)!

After the community awards announcements, I had the incredible opportunity to join a fireside chat about the community with two other great community award winners, where we discussed the dbt community and our personal journeys and experiences.

Dakota Kelley played a key role in helping us win the Partner of the Year Award and contributed to my own recognition with the community award once again. I also had the privilege of sharing the stage with Dakota during a speaking session titled “Advanced Pipelines in dbt Cloud,” which was a great opportunity to highlight their impact.

And last but certainly not least, what made all these awards even more special was that just days before Coalesce, I got engaged to my girlfriend, Danielly, and that’s a great award for me!

Biggest Announcements at dbt Labs’ Coalesce 2024

Thanks for reading up to here! This is the part you came here for—it’s time to talk about the dbt announcements.

One dbt

The main focus of the whole conference was, without a doubt, around the concept of One dbt.

According to the keynotes and presentations, dbt Labs is now moving towards a path where dbt is no longer just a transformation tool as it used to be. dbt is becoming one framework that helps everyone—from analysts to decision-makers, from companies of all sizes, whether it’s one person or thousands, and regardless of the platform they use. Whether it’s dbt Core or Cloud, it’s just one dbt.

To accomplish this, dbt has been integrating features that cover what they introduced as the data control plane—features like governance, orchestration, semantics, and catalog—in addition to the transformation part of the pipeline, of course.

Also, dbt is becoming more flexible and cross-platform, more collaborative, and more trustworthy than ever. Those were the words they used to announce the new features coming to dbt Cloud!

Flexible & Cross-Platform

For this category, the announcements are related to interoperability and adding support to more platforms and tools. This makes dbt more flexible, allowing you to work with the tool that fits you the best. For the announcements, we had two big ones that are closely related:

Iceberg Table Support: dbt Cloud now supports Apache Iceberg. Why is this huge? Iceberg is the open-source format for analytics tables, and the industry is adopting this standard. Customers want to be able to use the same format across platforms, and vendors are listening to it. And now you will be able to use Iceberg with dbt! You just need to add the table_format configuration to your model. Additionally, you can add the external_volume and base_location_subpath configurations to specify where dbt will write the Iceberg table’s metadata and files. You can read more about the Iceberg configs here.

				
					{{
 config(
   materialized = "table",
   table_format="iceberg",
   external_volume="s3_iceberg_snow",
 )
}}

select * from {{ ref('raw_orders') }}
				
			

The Iceberg support makes it possible for the second announcement to come true:

Cross-Platform dbt Mesh: With dbt mesh, you can have different projects on the same platform. Now, you’ll be able to share models across different data platforms!

That’s the magic of using Iceberg: you can use the same table format on different platforms to reference models across platforms! 

To be able to do it, you need to:

  • Integrate both platforms with the same Iceberg catalog.

  • Configure the upstream model to be public and to write to your Iceberg Catalog.

It will start with Snowflake, Databricks, Redshift, and Athena, but soon support for more platforms will be added.

And here are some other announcements for the flexibility and cross-platform category. dbt is adding support for more platforms and semantic layer connections:

  • New Integrations: dbt is welcoming AWS Athena (GA) and Teradata (Preview) to the family!

  • BI Tool Integration: A new dbt Semantic Layer connection to Power BI is coming soon!

Lastly, dbt Labs announced a cost optimization tool integrated into dbt Cloud (coming in 2025), where you will monitor your costs and get recommendations to reduce them.

Collaborative

The collaborative category of announcements aims to allow more people with different skills and backgrounds to work together in dbt.

Again, for the category, we have two major announcements:

Visual Low-Code Editor: An intuitive, visual drag-and-drop interface (currently in private beta) that allows you to create dbt models without writing SQL. Additionally, you can switch between code and visuals as you please.

For me, this is awesome for a few reasons:

  1. People who don’t write SQL can write dbt models.

  2. You can write your dbt model with SQL as you are used to, but you can explain your model visually for less technical folks and make it much more intuitive.

  3. You can more easily check the output of some parts of your transformation, which is great for debugging and explaining the code to someone else.

  4. You can find some problems in your code that you would not catch so easily looking at the SQL, like orphans CTEs. These CTEs, for example, that are not referenced in any part of the code, would appear as a single box with no connection in your visualization.

So, now people have three ways of creating dbt models and interacting with dbt Cloud, fitting from the more technical to less technical personas, dbt Cloud CLI, dbt Cloud IDE, and the upcoming Visual editor. Multiple ways of working in One dbt.

The other announcement for this category was not that new, but it came with some new additions.

dbt Copilot: Your AI assistant (currently in beta)! Copilot was already announced as an AI tool for generating tests, documentation, and semantic models. But at Coalesce 2024, some more features were added:

  • It can now answer data questions, so even those not used to SQL can get useful information from the data.

  • Plus, you can now bring your own OpenAI API key to Copilot!

In summary, these two announcements make it much easier for less technical users to work with dbt and make the work of technical people more accessible and faster.

Trustworthy

The last category of announcements is the ones that help you understand your data better and have a complete view of your project, increasing your trust in your data.

The major announcements for this category were

Advanced CI with Compare Changes: This feature gives you detailed insights into how your changes impact your data before deploying to production. You can see what will be added, removed, and changed from your tables so you don’t have any bad surprises in production. Also, the advanced CI makes it a lot easier for reviewers to review PR requests.

Auto-Exposures with Tableau: Automatically populate your dbt DAG with downstream exposures in Tableau (Power BI support coming soon). 

This has already been announced, but some new features are coming. Soon, you can see more information in the exposure node, access the Tableau dashboard from the dbt Explorer, and embed Data Health Tiles (the next announcement).

Data Health Tiles: Embed health signals like data quality and freshness within any dashboard, giving stakeholders confidence in the data. For example, you can add these dbt Health Tiles to your Tableau Dashboard.

A lot of exciting new stuff is coming to dbt Cloud, and in case you have already forgotten all the announcements, check out this image that summarizes everything.

dbt Core v1.9

Besides the One dbt announcement, some very interesting new features are coming to dbt Core v1.9, which is already in beta at the moment of this blog. Feel free to test it out and give some feedback to dbt Labs!

New Snapshots Configuration

Snapshots in dbt are going through some changes. They started in dbt as resources defined in a YML file, then in SQL files inside a Jinja snapshot block, and now they’re returning to their YML origins.

These changes are not coming out of nowhere. There’s been a lot of discussion about Snapshots for the last few years, and dbt Labs brought the community heavily to this discussion, which I greatly appreciate. They admit that sometimes they can fail and will do their best to fix these problems with the community’s feedback.

So, here’s what’s new for Snapshots:

YML instead of SQL: Snapshots can now be configured in YML like sources are. So, for example, instead of creating an orders_snapshot.sql file to snapshot your source table for orders, you would create a YML file like this:

snapshots/orders_snapshot.yml
				
					snapshots:
 - name: orders_snapshot
   relation: source('jaffle_shop', 'orders')
   config:
     schema: snapshots
     database: analytics
     unique_key: id
     strategy: timestamp
     updated_at: updated_at
				
			

Here, you need to define the name, relation, and the same configurations as before. Then you might ask, “OK, but in the SQL file, I could write some custom transformation for the Snapshot, and now?” The answer is you still can!

By default, this Snapshot YML file will consider you want to select everything for the source, like a select * from source. And that’s the desired behavior for most of the Snapshots. If you need to change it for any reason, you can define an ephemeral model and call it in the relation, like

models/ephemeral_orders.sql
				
					{{ config(materialized='ephemeral') }}
select *
from {{ source('jaffle_shop', 'orders') }}
where some_condition
				
			

Refer to this model in the snapshot’s config:

snapshots/orders_snapshot.yml
				
					snapshots:
 - name: orders_snapshot
   relation: ref('ephemeral_orders')
   config:
     schema: snapshots
     database: analytics
     unique_key: id
     strategy: timestamp
     updated_at: updated_at
				
			

target_schema is now optional: This was an old complaint from the community, and dbt Labs fixed it. Before 1.9, all Snapshots were written into the same schema, no matter the environment, making the development harder. But no more than just one Snapshot for both prod and dev; now you can keep separate snapshots for each environment or keep one for all environments—it’s your choice!

When target_schema is omitted, dbt will follow the rules defined by generate_schema_name or generate_database_name macros.

Note in the example file that you can set custom schemas and databases as in any other resource. 

Meta column names are customizable: Another old complaint from the community. When your snapshot is materialized, it creates metadata columns such as dbt_valid_from and dbt_valid_to, and we couldn’t change their names. What people usually did was create a view on top of the Snapshot and rename the columns.

Now, you can rename them in the Snapshot YML file using the snapshot_meta_column_names configuration.

snapshots/orders_snapshot.yml
				
					snapshots:
 - name: orders_snapshot
   relation: ref('ephemeral_orders')
   config:
     unique_key: id
     strategy: timestamp
     updated_at: updated_at
     snapshot_meta_column_names:
       dbt_valid_from: start_date
       dbt_valid_to: end_date
				
			

New Incremental Strategy: Micro-Batching

As for the second major dbt Core v1.9 announcement, dbt Labs launched a new incremental strategy, called micro-batching, based on the experimental strategy insert_by_period.

Before v1.9, if you wanted to run a dbt incremental model, you were only able to run the whole incremental period in one single query. For example, if your incremental model processes one week of data, this whole week would be run in one query.

With micro-batch, you can break this query into micro-batch queries. So you could run your week into multiple days with one single model by just setting the configs. Let me show you what it would look like:

				
					{{
    config(
        materialized='incremental',
        incremental_strategy='microbatch'
        event_time='_loaded_at',
        batch_size='day',
        lookback=7,
        begin='2024-01-01',
        full_refresh=false
)
}}

...
				
			

You need to:

  • Set your incremental_strategy as microbatch

  • Define an event_time. According to the docs, event_time is the column indicating “at what time did the row occur.” And it is required for your microbatch model and any direct parents that should be filtered.

  • Define the batch_size. The batch_size is the granularity of your batches For our example, we want to break one week into 7 days, so we choose day. It can be hour, day, month or year.

  • Define begin. begin is the start point of the incremental model for initial and full-refresh runs.

  • Define lookback (optional): lookback is the number of batches, before the current timestamp, dbt will load.

Also, micro-batches make it easy for you to run backfills. You just need to pass via command line the interval you want to backfill. For example

				
					dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"
				
			

Another cool thing is that if you have some failed batches, you can use dbt retry to rerun only those and not all the data.

--sample (coming up)

Lastly, I just wanted to comment on a feature I’m very excited about. It is not available yet, but it will be in the future.

This is the sample flag. A lot of developers I know try to avoid high costs in development by running only samples of their tables. And they do it in different ways. They might add the sample to the SQL code with some if/else blocks; they can override built-in macros or do something else.

Fortunately, dbt Labs is working on integrating this sampling capability into dbt Core natively, and I can’t wait to use it!

If you want to see everything new in dbt Core v1.9, check out this page.

Closing Thoughts

Coalesce 2024 was an incredible event, filled with major awards, exciting announcements, and an amazing network of people. It’s hard to pick a favorite moment! I’m eager to dive into all the new features and excited to see where dbt Labs is heading in the future. I hope they continue to strengthen the community and give us all a strong voice moving forward.

And if your organization is looking to make the most of dbt, phData is ready to assist. As dbt’s Partner of the Year, we have the expertise to ensure your dbt setup is optimized and powerful, driving your organization forward. If you are still in doubt, check out our whitepaper on accelerating and scaling dbt for enterprise.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit