Inline model documentation + tests #6853

jakebiesinger · 2023-02-03T17:03:33Z

jakebiesinger
Feb 3, 2023

The high-level requirement in my mind is "let me specify all the informational bits about a model inside the model itself". For me, this would be:

model descriptions (docs blocks)
column descriptions
model + column tests

This topic seems rather perennial-- it keeps popping up again and again! Here's a quick citation list:

This discussion docs+tests directly in a model #5093 which covers several approaches
I wrote a series of PRs 4 years ago to allow {% docs %} blocks to be specified in .sql files #1042 and then to auto-attach them to models + columns based on the doc block name #1043
FR for the same in Parse docs blocks when they appear in .sql files #979, with lots of community traction + interactions
Several scripts + external packages that work around this shortcoming
- @vergenzt 's internal parsing script cited above
- @HarlanH's .sql parser --> .md generator https://gist.github.com/HarlanH/277006989774372515c7130e63809315
- Feature: Add tests to dbt docs UI and references #1051 requesting that tests be specifiable inline (side point on the main FR there)
- DBT discourse post on doing exactly this https://discourse.getdbt.com/t/here-is-a-way-to-write-dbt-docs-as-sql-comment/1658 and the associated python package https://github.com/anelendata/dbt_docstring (again, auto-generates yml files from sql files)
- dbt-codegen https://github.com/dbt-labs/dbt-codegen, a macro intended to be placed in model .sql files and which dumps yml to stdout so you can manually copy-paste it into schema.yml files

It seems clear that the community wants this feature in some form. Why not bring it into dbt-core and give the people what they want? 😸

Over the years, the DBT team has seemed to like this idea, but has raised concerns:

The model file might get too big (dozens of columns in a model might be confusing, especially if all the docs are inlined)
Concerns around availability of macros + properties within these blocks, due to the separate lifecycle DBT is running under the hood for documentation
Desire for it to be clean + sufficiently explicit to not feel too magical / confusing
Compatibility with existing systems, and having Yet Another Way to specify docs + schema

So far, we've seen proposals for syntax including:

expanding the config jinja call to allow docs + descriptions + tests to be specified inline (so a single massive call to config would put all the things in place)
Allow docs blocks in sql files, with possible support to auto-attach them to models (since they appear in the model sql file anyway)
Custom "javadoc"-style annotations, possibly surrounded by /** */ blocks (which can help with IDE confusion around jinja blocks)

I am happy to (again) put in some elbow grease to get this done. Is there appetite for receiving it on the DBT team side?

jtcohen6 · 2023-02-06T11:43:07Z

jtcohen6
Feb 6, 2023
Maintainer

@jakebiesinger This was very cool to read. Thank you for kicking it off as a discussion, for having done your homework, and for sticking around through the years to see this idea through :)

There are two challenges here:

Finding a syntax that we can live with, given that we're ultimately talking about inlining structured data (currently expected as yaml) inside another language (Jinja-SQL, now also Python!)
Finding the capacity/priority to get this shipped. No question, this would be a nontrivial improvement to UX & ergonomics—as you note, it's been requested & upvoted by dozens of community members—and it would also require a nontrivial lift. Over the past few years, it's been tough for us to prioritize ergonomic improvements, relative to long-overdue needs around performance and foundational APIs.

It sounds like you'd be open to diving in and working on this, if we can set you up with a clear set of foundations to work in. Our parsing logic is still quite thorny (though better than it used to be); I couldn't promise a walk in the park.

I'm getting ahead of myself, though. First thing we'd need is an agreeable answer to (1).

Finding a syntax

Let's check each of the options:

expanding the config jinja call to allow docs + descriptions + tests to be specified inline (so a single massive call to config would put all the things in place)

I'm skeptical of this for two reasons:

config has special rules around hierarchical inheritance / merging, which makes sense for most of its current options, and makes a lot less sense for the "non-config" properties (description, tests, columns).
I don't think defining structured data in Jinja and passing it into the {{ config() }} macro is a good user experience, and it also leads to all sorts of confusion with the implicit differences between the information available at "parse time" versus "runtime." (If you've never had to dig into the docs on pre/post hook rendering, and double-curly nesting, that's probably a good thing: Expand guidance on late-rendering for hooks docs.getdbt.com#2818)

Allow docs blocks in sql files, with possible support to auto-attach them to models (since they appear in the model sql file anyway)

This specific idea (#979) got quite a bit of traction. As you say, I'm not sure how we'd do the "auto-attaching," except via some implicit magic with the naming convention (docs block named model_name__column_name, or named column_name and defined in a model file named model_name).

Custom "javadoc"-style annotations, possibly surrounded by /** */ blocks (which can help with IDE confusion around jinja blocks)

This would require us to do SQL parsing, which isn't something dbt has done to date. While the answer is definitely not "never," I'm quite wary of taking an accidental half-step into this particular quagmire.

It would be much more tenable for Python models, where we've already toyed with the idea of supporting model-level descriptions via docstring:

def model(dbt, session):
    """
    This is my awesome model!
    """
    df = ...

What about YFM?

Here's an idea, which you have every right to hate: What if we did this as yaml front matter (YFM)? After scanning through the older linked issues, I didn't see this come up; I may have missed it, or I may be proposing something that everyone else has tacitly agreed would be terrible. For whatever reason, I don't hate the idea, at least not at first glance.

Idea being, these remain two separate & independent sets of information, which just happen to live in the same file. The yaml in the front matter can include any of the key-value pairs that could also be passed to the models: list in a yaml file. It would be parsed at the same time as the rest of the model file, but using our yaml parser, instead of the ModelParser.

The front matter would either take priority over, or be mutually exclusive with, defining yaml properties for this model in a separate yaml file. If the model also defines {{ config() }} (Jinja-SQL) or dbt.config() (Python), it would take precedence over, or be mutually exclusive with, the config defined in the front matter.

---
description: This is my model's description. It can reference a {{ doc('block') }} if it wants to.
config:
  materialized: incremental
  unique_key: id
columns:
  - name: first_col
    tests:
      - unique
      - not_null
---

select ...

---
columns:
  - name: first_col
    tests:
      - unique
      - not_null
---

def model(dbt, session):
    ...

The model file might get too big (dozens of columns in a model might be confusing, especially if all the docs are inlined)

This approach definitely doesn't help with the sheer model size, but yaml does offer a few line-saving options if you're willing to get a bit JSON-y:

---
columns:
  - {name: first_col, tests: [not_null, unique]}
  - {name: second_col, tests: [not_null], description: ...}
  - ...
---

select ...

To be clear, what I'm not proposing here is the ability to use Jinja or yaml to dynamically create/populate the front matter contents. I do think that remains, rightly, the domain of code-generation plugins/packages. (Our work on refactoring dbt-core's foundational APIs could eventually unlock even more-capable code generation, in the form of more-direct node generation.)

17 replies

joellabes Feb 22, 2023
Collaborator

Jinja would only be available in the SQL body

There will be some things that {{ config() }} can do, which YFM can't: accepting macros as inputs, and late-rendering hooks. Those are known limitations of yaml configs, and we can solve for them over time; it's out of scope for this work.

Getting Jinja in the YFM probably looks something like this, in pursuit of a day where {{ config() }} and its assorted baggage isn't necessary

jakebiesinger Feb 28, 2023
Author

Just checking in here... I've made a lot of progress but have had to scrap my approach 2-3 times. Getting this to work for a full parse is pretty trivial, but partial parsing is a beast. The PR has been updated with latest progress. I definitely don't understand enough about how this is supposed to work, so a lot of what I'm doing feels like shooting in the dark.

Current status:

Seems to work 100% for full reparse
In partial-parsing mode, the saved manifest doesn't seem to be retaining the changes from YFM

There are a lot of things I've tried so far, but I wonder if it would be helpful at this point to connect to someone who understands this code, just so I could get some pointers... say 15 minutes? @joellabes or @jtcohen6 is something like that possible? Or if there's a good Slack channel I can ask my questions and explain my approach, I'd be all for it. I'm certain I'm doing some silly things that another set of eyes would clear up quickly.

joellabes Feb 28, 2023
Collaborator

Sorry to hear it's been a mission! I'm not a Python dev so won't be much help with your implementation woes, but the best first step is probably to open a draft PR containing your code and do a self-review, indicating all the places you'd like feedback and the sorts of issues you're running into, so that someone from the Core team can give more directed help

(Since you asked, there's also a #dbt-core-development channel in the Slack)

jtcohen6 Feb 28, 2023
Maintainer

@jakebiesinger I'd also be open to grabbing some time (~30 min) to chat! I'll send you a DM in the dbt Community Slack; the next few weeks are busy, but we'll see if we can find / make something work.

@gshank has offered to share some wisdom on partial parsing. As Joel recommended, if you could push up a draft PR and call out the problem spots in the code, she'll be able to take a look before we chat. It's possible that the codepaths here are tangled enough, that our best bet will be to agree on a reasonable stopping point, and for someone on our side to carry it over the finish line. (Even if that's the way we choose to go, we'd never have gotten this far if you hadn't first taken the initiative!)

FWIW - this discussion has triggered some valuable internal conversation on our team around configs vs. properties — what's defined where, should these really (finally) just be the same thing. The question for us to answer is: If anything could be passed into {{ config() }} — including documentation and tests, the original prompt for this issue — is there still value in pursuing the yaml frontmatter approach? I think yes! That's the discussion we had above, before we moved into implementation, and I still stand by it. I think YFM could offer a better user experience than the code-as-config story we currently tell with Jinja macros today, even if the latter will still offer (might always offer) greater flexibility for more-advanced use cases.

jakebiesinger Mar 22, 2023
Author

@jtcohen6 @joellabes thanks for the pointers and discussion! Next time, I'll start with asking around in Slack. The codebase has definitely gotten more complex since I last touched it 4-5 years ago. I also somehow missed the blurb about you core-development slack channel over in your CONTRIBUTING page.

gshank · 2023-03-01T23:30:08Z

gshank
Mar 1, 2023
Maintainer

@jakebiesinger I took a look at your pull request, and I admire your enthusiasm for diving into some hairy code. However, the approach of creating two different file objects was not the way to go. Once I started thinking about it, it was hard to let it sit, so I borrowed some of your code from yaml_helpers.py and created a draft pull request which calls the model schema parser from the model parser: #7100.

There's still a fair amount of work to do to make sure that edge cases are handled and dealing with config in both the SQL files and a schema file, etc, but I think as a proof of concept it looks pretty reasonable.

Thanks for providing the inspiration :) -- we'll be sure to put your name on as co-contributor.

1 reply

jakebiesinger Mar 22, 2023
Author

@gshank thanks for taking this over. I'm just coming back into this community + code after a long hiatus. I probably should have started asking over in the #dbt-core-development Slack channel or talked through approaches, etc. There's only so much you can do with limited time, and I'm just grateful the work is getting some traction.

raphael-boegel-pcg · 2024-03-06T09:50:10Z

raphael-boegel-pcg
Mar 6, 2024

For proper code highlighting your IDE propably requires a Plugin.
And if a plugin is required anyway, maybe the hole thing could be solved by your IDE showing both the model and its documentation next to each other. Wouldn't that be a great feature for the dbt cloud IDE to start with?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline model documentation + tests #6853

{{title}}

Replies: 3 comments 18 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Inline model documentation + tests #6853

jakebiesinger Feb 3, 2023

Replies: 3 comments · 18 replies

jtcohen6 Feb 6, 2023 Maintainer

Finding a syntax

What about YFM?

joellabes Feb 22, 2023 Collaborator

jakebiesinger Feb 28, 2023 Author

joellabes Feb 28, 2023 Collaborator

jtcohen6 Feb 28, 2023 Maintainer

jakebiesinger Mar 22, 2023 Author

gshank Mar 1, 2023 Maintainer

jakebiesinger Mar 22, 2023 Author

raphael-boegel-pcg Mar 6, 2024

jakebiesinger
Feb 3, 2023

Replies: 3 comments 18 replies

jtcohen6
Feb 6, 2023
Maintainer

joellabes Feb 22, 2023
Collaborator

jakebiesinger Feb 28, 2023
Author

joellabes Feb 28, 2023
Collaborator

jtcohen6 Feb 28, 2023
Maintainer

jakebiesinger Mar 22, 2023
Author

gshank
Mar 1, 2023
Maintainer

jakebiesinger Mar 22, 2023
Author

raphael-boegel-pcg
Mar 6, 2024