Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure data dictionary docs and add a class for etl_groups #2336

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented Feb 27, 2023

PR Overview

This PR adds layers to the Data Dictionary documentation (and metadata) so that it's easier for users to see what data are available to them. The best way to view the changes from this PR is to pull this branch, run the docs build, and view the Data Dictionaries page.

We've had a series of conversations about how we structure, present, and view the final data products:

In #2289, Jesse suggested that we organize the PUDL tables by suffix. In looking at what kind of categorization we already had, I noticed the etl_group flag. This is part of the class Resource definition and spans the following categories:

    "eia860",
    "eia861",
    "eia923",
    "entity_eia",
    "epacems",
    "ferc1",
    "ferc1_disabled",
    "ferc714",
    "glue",
    "outputs",
    "static_ferc1",
    "static_eia",
    "static_eia_disabled",
    "eia_bulk_elec",
    "static_pudl",

Previously, this was just a string Literal that was defined in the type hints. I turned it into a class of its own, class ETLGroup and a dictionary in constants ETL_GROUPS so that it would mirror the clear scope and definition of other, similar categories like contributors or licenses. Once I added definitions to all of the etl_groups, I pulled them into the metadata, using them to organize the flow of the Data Dictionaries Page. Instead of landing on a page with all the tables and all the columns in alphabetical order, you land on a page that shows you a list of tables by etl_group with a definition of each etl_group. When you click on a specific table it brings you to the page with the table and column definitions.

My goal with these changes was to improve user experience while maintaining searchability / abundance. I wanted users to be able to ctr F through the page with allllll the column names if they wanted to see where else certain types of data pop up, but I also wanted them to be able to find something specific without having to wade through tons of text and tables.

Ideally I would like to have a similar constant/class structure to include definitions for the SQL dbs themselves. This way we can pull and even higher-level description of what's what into the metadata and Datasette. However, I acknowledge that we can only make so many changes before the big Dagster integration.

Ultimately, I could also see us creating three-ish levels of categorization here: 1) by database, 2) by data source (this is already in the metadata) 3) by table type (etl_group metadata I just made).

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

…p types not just one long alphabetical list of tables. Took out the table fields for now, want to add them back in later
…d forth between version with columns and version with just table descriptions
… Add etl_group description to the docs, fix the harvest_test.py tests so that etl_groups are accurately represented. Add etl_group to the docstrings in the classes module.
Comment on lines +230 to +232
"outputs": {
"title": "Output Tables",
"description": "Blah",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this as blah because we still haven't really incorporated this yet and there's a lot there. Also eventually, maybe output is too big a category and we'll do something like eia_output or something like that....idk. But @cmgosnell if you have any edits to any of the descriptions here lmk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explain here that these tables are not in the database but we use some of the same schema / typing information for dataframes that we derive from database tables.

In the near term we're going to need to restructure a lot of this using something other than ETL group, which is really just an ad-hoc grouping of ETL tasks, based on where various tables were getting processed in our old ETL structure, which is completely replaced by Dagster.

…nt pudl_db_fields from cleanup_rsts so it gets removed
@zaneselvans zaneselvans added docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. labels Mar 2, 2023
@zaneselvans
Copy link
Member

You don't have to pull the branch and build the docs to see the new output. The documentation for PRs is automatically build by Read The Docs, and you can go view them by clicking on the "details" link in the checks.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One design issue I see here is that you're introducing coupling between the Package, ETLGroup, and Resource classes. Package should know how to produce RST representing a Package. ETLGroup should know how to produce RST representing an ETLGroup, and Resource should know how to output RST representing a Resource.

These output routines can be nested and have some parameters, but keep the logic for each class within that class.

# Iterates over ETLGroups within package, calling ETLGroup.to_rst()
package.to_rst(by="etl_group")

# Iterates over Resources within package, calling Resource.to_rst()
package.to_rst(by="resource")

And rather than passing in a random Jinja template to Package, it could know which template to use internally, based on which kind of RST output it's generating (by ETLGroup vs. by Resource)

I'm not entirely sure what about these changes is breaking the ETL, but my guess is it has to do with the places where you've put an ETLGroup class in a dictionary of parameters, rather than a dictionary that can be used to construct ETLGroup.

@@ -5,7 +5,7 @@ Data Dictionaries

.. toctree::
:caption: Data Processed & Cleaned by PUDL
:maxdepth: 1
:maxdepth: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maintain a direct link to pudl_db_fields in the Data Dictionaries section of the navigation sidebar? right now it seems like the only way to reach it is by navigating through the ETL Group page. If someone knows they want to go do Ctrl-F it should be easy to get there.

@@ -1,11 +1,24 @@
:orphan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just removing this get it back in the navigation sidebar?

Comment on lines -1 to -8
-------------------------------------------------------------------------------
{{ resource.name }}
-------------------------------------------------------------------------------

{{ resource.description | wordwrap(78) if resource.description else
'No table description available.' }}
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this part of the resource output means that the Resource.to_rst() is no longer a representation of a Resource as RST. This seems bad organizationally, and couples Resource output with the Package class unnecessarily and confusingly.

Instead I think you could make output of the resource level description text + resource schema conditional with a a parameter, keeping all of the resource output logic inside the resource class, but still allowing other code to get at only the portion of that available output that they want based on what arguments they pass in.

Comment on lines +14 to +20
{{ resource.name }}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

{{ resource.description | wordwrap(78) if resource.description else
'No table description available.' }}
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource-specific output probably does not belong in the package class or RST template.

Comment on lines +9 to +19
{% for group, resources in package.get_resource_by_etl_group().items() %}
{{ package.get_etl_group()[group].title }}
-------------------------------------------------------------------------------

{{ package.get_etl_group()[group].description }}

{% for resource in resources %}
- :ref:`{{ resource.name }}`

{% endfor %}
{% endfor %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be able out output all of the ETLGroups as RST, it seems like creating an ETLGroup.to_rst() method would be cleaner. It could encapsulate all of the logic of producing the RST for the ETLGroup -- whatever text belongs at the top of the ETLGroup + iterating through all of the Resources that are part of the ETLGroup and having each of them output themselves with Resource.to_rst().

"eia860": {
"title": "EIA 860",
"description": """Tables derrived from the EIA Form 860. See our
:doc:`../data_sources/eia860` page for more information.""",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding hard-coded relative paths to particular RST documents within the string descriptions is going to be fragile. It means there's only one location in the documentation that this description can be used to refer to that other page. If it's an absolute document path instead it would work from anywhere (so long as we don't change the structure of the documentation files), but ideally this should probably be a separate piece of metadata that's attached to the data source, and each ETLGroup class could have an attribute that stores that information separate from the text description, and it could be integrated into the Jinja template used to output the RST.

But since the etl group organization is probably going to impacted or replaced by the use of Dagster to organize the ETL, I'm not sure it makes sense to spend time on that.

Comment on lines +1818 to +1827
def get_resource_by_etl_group(self) -> dict[str, list[Resource]]:
"""blah."""
resource_dict = {}
for etl_group in ETL_GROUPS.keys():
resource_dict[etl_group] = [
resource
for resource in self.resources
if resource.etl_group == ETLGroup.from_id(etl_group)
]
return resource_dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems like it belongs in the ETLGroup not the Package. Then a Package could look up all its own ETL Groups, each ETL Group could look up its constituent resources.

@@ -35,6 +35,7 @@ def _assert_frame_equal(a: pd.DataFrame, b: pd.DataFrame, **kwargs: Any) -> None
],
"primary_key": ["i", "j"],
},
"etl_group": ETLGroup.from_id("static_pudl"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want a dictionary representation of the class here, not an instantiated class itself.

@@ -338,6 +347,7 @@ def test_resource_with_only_key_fields_harvests() -> None:
d["schema"]["fields"] = [
{"name": name, "type": FIELD_DTYPES[name]} for name in d["schema"]["fields"]
]
d["etl_group"] = ETLGroup.from_id(d["etl_group"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again here, do you really want the instantiated Pydantic model? Or a dictionary that can be used to instantiate it when unpacked?

for name in included_sources:
source = DataSource.from_id(name)
source_resources = [res for res in package.resources if res.etl_group == name]
source_resources = [
res for res in package.resources if res.etl_group == ETLGroup.from_id(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you're comparing two instances of ETLGroup using ==. Are you sure that does what you want? How is equality defined for these classes? Do you really want to check whether the ETLGroup names (which are strings) are the same?

@aesharpe
Copy link
Member Author

aesharpe commented Mar 2, 2023

Thanks for the comments @zaneselvans!

In the near term we're going to need to restructure a lot of this using something other than ETL group, which is really just an ad-hoc grouping of ETL tasks, based on where various tables were getting processed in our old ETL structure, which is completely replaced by Dagster.

My intention is to create a sub-resource table metadata grouping structure that we can build off when we implement the Dagster branch. Which aspects of this PR do you think are worth investing time in between now and then? I could see us changing the name from etl_group to something else down the road if that type of categorization becomes obsolete.

@zaneselvans
Copy link
Member

I agree that with hundreds of tables across all of the databases we're producing now, and the prospect of ~100 tables even just in the main PUDL DB once the Dagster refactor is done, some kind of categorization and grouping of the tables is definitely required for users to navigate the available data. I think what those categories / groupings should look like was the topic of this discussion.

I think those conceptual / organizational groupings should probably be pretty distinct from the software / data pipeline, and using our ad-hoc ETL Groups for that purpose here entangles structural ETL stuff with documentation / semantic categories in a way that I'm not sure is helpful. The ETL groups were really just added as a way to address groups of tables that needed to be treated similarly in the old ETL. It was a little hack.

I think there are probably multiple dimensions that we're going to want to use to organize the new universe of tables:

  • What stage of the data pipeline is this data from? Is it freshly extracted? Cleaned? Normalized? Denormalized? Derived outputs?
  • Where did the data come from? Is it EIA-860? EIA-923? FERC-1? A combination of different sources?

I suspect that we're going to end up with a large number of tables that aren't meant for direct use by most kinds of users: tables from early in the pipeline, highly normalized tables, skinny tables that contain imputed values or particular analytical outputs, maybe crosswalks that allow joins between tables that were hard to work with like FERC1+EIA. These are the building blocks that can be assembled by us or others comfortable with databases and Python to build new things, or track and debug existing processes. We probably won't want to expose all of these tables in the documentation (or at least, not all in the same part of the documentation)

Then we'll also probably have a smaller number of wide tables with tons of denormalized data and imputed / estimated / aggregated values in them, that most people use for analysis, selecting the subset of columns they actually need for whatever purpose they have in mind.

So I think a useful thing in terms of producing table-level documentation would probably be to allow the generation of documentation based on an arbitrary group of tables (maybe a ResourceGroup?) that is conceptually separated from the data pipeline, and can be customized to produce different docs outputs on different pages, not necessarily showing all of the tables in the database.

Datasette Templates

I think getting some kind of database-level descriptions integrated into the top level https://data.catalyst.coop page with a custom Datasette Jinja template would be helpful for directing people to the right outputs (especially with the XBRL vs. DBF vs. PUDL versions of FERC 1 data)

I think getting the existing table-level descriptions onto the individual database pages on Datasette would be helpful, e.g. on https://data.catalyst.coop/pudl have a table description under the table name, and before the list of columns. I think this would mean a custom Datasette Jinja template.

Base automatically changed from dev to main January 5, 2024 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

2 participants