Our models (typically) fit into three main categories: staging, marts, base/intermediate. For more detail about why we use this structure, check out this discourse post. The file and naming structures are as follows:
├── dbt_project.yml
└── models
├── marts
| └── core
| ├── intermediate
| | ├── intermediate.yml
| | ├── customers__unioned.sql
| | ├── customers__grouped.sql
| └── core.yml
| └── core.docs
| └── dim_customers.sql
| └── fct_orders.sql
└── staging
└── stripe
├── base
| ├── base__stripe_invoices.sql
├── src_stripe.yml
├── src_stripe.docs
├── stg_stripe.yml
├── stg_stripe__customers.sql
└── stg_stripe__invoices.sql
- All objects should be plural, such as:
stg_stripe__invoices
- Base tables are prefixed with
base__
, such as:base__<source>_<object>
- Intermediate tables should end with a past tense verb indicating the action performed on the object, such as:
customers__unioned
- Marts are categorized between fact (immutable, verbs) and dimensions (mutable, nouns) with a prefix that indicates either, such as:
fct_orders
ordim_customers
- Model-specific attributes (like sort/dist keys) should be specified in the model.
- If a particular configuration applies to all models in a directory, it should be specified in the
dbt_project.yml
file. - In-model configurations should be specified like this:
{{
config(
materialized = 'table',
sort = 'id',
dist = 'id'
)
}}
- Marts should always be configured as tables
- Only
stg_
models (orbase_
models if your project requires them) should select fromsource
s. - All other models should only select from other models.
- Every subdirectory should contain a
.yml
file, in which each model in the subdirectory is tested. For staging folders, the naming structure should besrc_sourcename.yml
. For other folders, the structure should befoldername.yml
(examplecore.yml
). - At a minimum, unique and not_null tests should be applied to the primary key of each model.
- Schema, table and column names should be in
snake_case
. - Use names based on the business terminology, rather than the source terminology.
- Each model should have a primary key.
- The primary key of a model should be named
<object>_id
, e.g.account_id
– this makes it easier to know whatid
is being referenced in downstream joined models. - For base/staging models, fields should be ordered in categories, where identifiers are first and timestamps are at the end.
- Timestamp columns should be named
<event>_at
, e.g.created_at
, and should be in UTC. If a different timezone is being used, this should be indicated with a suffix, e.gcreated_at_pt
. - Booleans should be prefixed with
is_
orhas_
. - Price/revenue fields should be in decimal currency (e.g.
19.99
for $19.99; many app databases store prices as integers in cents). If non-decimal currency is used, indicate this with suffix, e.g.price_in_cents
. - Avoid reserved words as column names
- Consistency is key! Use the same field names across models where possible, e.g. a key to the
customers
table should be namedcustomer_id
rather thanuser_id
.
For more information about why we use so many CTEs, check out this discourse post.
- All
{{ ref('...') }}
statements should be placed in CTEs at the top of the file - Where performance permits, CTEs should perform a single, logical unit of work.
- CTE names should be as verbose as needed to convey what they do
- CTEs with confusing or noteable logic should be commented
- CTEs that are duplicated across models should be pulled out into their own models
- create a
final
or similar CTE that you select from as your last line of code. This makes it easier to debug code within a model (without having to comment out code!) - CTEs should be formatted like this:
with
events as (
...
),
-- CTE comments go here
filtered_events as (
...
)
select * from filtered_events
-
Use trailing commas
-
Indents should be four spaces (except for predicates, which should line up with the
where
keyword) -
Lines of SQL should be no longer than 80 characters
-
Field names and function names should all be lowercase
-
The
as
keyword should be used when aliasing a field or table -
Fields should be stated before aggregates / window functions
-
Aggregations should be executed as early as possible before joining to another table.
-
Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see this rant for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design.
-
Specify join keys - do not use
using
. Certain warehouses have inconsistencies inusing
results (specifically Snowflake). -
Prefer
union all
tounion
* -
Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers".
-
If joining two or more tables, always prefix your column names with the table alias. If only selecting from one table, prefixes are not needed.
-
Be explicit about your join (i.e. write
inner join
instead ofjoin
).left joins
are normally the most useful,right joins
often indicate that you should change which table you selectfrom
and which one youjoin
to. -
DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE. NEWLINES ARE CHEAP, BRAIN TIME IS EXPENSIVE
with
my_data as (
select * from {{ ref('my_data') }}
),
some_cte as (
select * from {{ ref('some_cte') }}
),
some_cte_agg as (
select
id,
sum(field_4) as total_field_4,
max(field_5) as max_field_5
from some_cte
group by 1
),
final as (
select [distinct]
my_data.field_1,
my_data.field_2,
my_data.field_3,
-- use line breaks to visually separate calculations into blocks
case
when my_data.cancellation_date is null
and my_data.expiration_date is not null
then expiration_date
when my_data.cancellation_date is null
then my_data.start_date + 7
else my_data.cancellation_date
end as cancellation_date,
some_cte_agg.total_field_4,
some_cte_agg.max_field_5
from my_data
left join some_cte_agg
on my_data.id = some_cte_agg.id
where my_data.field_1 = 'abc'
and (
my_data.field_2 = 'def' or
my_data.field_2 = 'ghi'
)
having count(*) > 1
)
select * from final
- Your join should list the "left" table first (i.e. the table you are selecting
from
):
select
trips.*,
drivers.rating as driver_rating,
riders.rating as rider_rating
from trips
left join users as drivers
on trips.driver_id = drivers.user_id
left join users as riders
on trips.rider_id = riders.user_id
- Indents should be two spaces
- List items should be indented
- Use a new line to separate list items that are dictionaries where appropriate
- Lines of YAML should be no longer than 80 characters.
version: 2
models:
- name: events
columns:
- name: event_id
description: This is a unique identifier for the event
tests:
- unique
- not_null
- name: event_time
description: "When the event occurred in UTC (eg. 2018-01-01 12:00:00)"
tests:
- not_null
- name: user_id
description: The ID of the user who recorded the event
tests:
- not_null
- relationships:
to: ref('users')
field: id
- When using Jinja delimiters, use spaces on the inside of your delimiter, like
{{ this }}
instead of{{this}}
- Use newlines to visually indicate logical blocks of Jinja