Restructure data dictionary docs and add a class for etl_groups #2336

aesharpe · 2023-02-27T21:25:37Z

PR Overview

This PR adds layers to the Data Dictionary documentation (and metadata) so that it's easier for users to see what data are available to them. The best way to view the changes from this PR is to pull this branch, run the docs build, and view the Data Dictionaries page.

We've had a series of conversations about how we structure, present, and view the final data products:

In #2289, Jesse suggested that we organize the PUDL tables by suffix. In looking at what kind of categorization we already had, I noticed the etl_group flag. This is part of the class Resource definition and spans the following categories:

    "eia860",
    "eia861",
    "eia923",
    "entity_eia",
    "epacems",
    "ferc1",
    "ferc1_disabled",
    "ferc714",
    "glue",
    "outputs",
    "static_ferc1",
    "static_eia",
    "static_eia_disabled",
    "eia_bulk_elec",
    "static_pudl",

Previously, this was just a string Literal that was defined in the type hints. I turned it into a class of its own, class ETLGroup and a dictionary in constants ETL_GROUPS so that it would mirror the clear scope and definition of other, similar categories like contributors or licenses. Once I added definitions to all of the etl_groups, I pulled them into the metadata, using them to organize the flow of the Data Dictionaries Page. Instead of landing on a page with all the tables and all the columns in alphabetical order, you land on a page that shows you a list of tables by etl_group with a definition of each etl_group. When you click on a specific table it brings you to the page with the table and column definitions.

My goal with these changes was to improve user experience while maintaining searchability / abundance. I wanted users to be able to ctr F through the page with allllll the column names if they wanted to see where else certain types of data pop up, but I also wanted them to be able to find something specific without having to wade through tons of text and tables.

Ideally I would like to have a similar constant/class structure to include definitions for the SQL dbs themselves. This way we can pull and even higher-level description of what's what into the metadata and Datasette. However, I acknowledge that we can only make so many changes before the big Dagster integration.

Ultimately, I could also see us creating three-ish levels of categorization here: 1) by database, 2) by data source (this is already in the metadata) 3) by table type (etl_group metadata I just made).

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

…p types not just one long alphabetical list of tables. Took out the table fields for now, want to add them back in later

…d forth between version with columns and version with just table descriptions

… Add etl_group description to the docs, fix the harvest_test.py tests so that etl_groups are accurately represented. Add etl_group to the docstrings in the classes module.

…rat for Package.to_rst()

aesharpe · 2023-02-27T21:29:51Z

src/pudl/metadata/constants.py

+    "outputs": {
+        "title": "Output Tables",
+        "description": "Blah",


I left this as blah because we still haven't really incorporated this yet and there's a lot there. Also eventually, maybe output is too big a category and we'll do something like eia_output or something like that....idk. But @cmgosnell if you have any edits to any of the descriptions here lmk.

I think we should explain here that these tables are not in the database but we use some of the same schema / typing information for dataframes that we derive from database tables.

In the near term we're going to need to restructure a lot of this using something other than ETL group, which is really just an ad-hoc grouping of ETL tasks, based on where various tables were getting processed in our old ETL structure, which is completely replaced by Dagster.

…nt pudl_db_fields from cleanup_rsts so it gets removed

zaneselvans · 2023-03-02T17:26:26Z

You don't have to pull the branch and build the docs to see the new output. The documentation for PRs is automatically build by Read The Docs, and you can go view them by clicking on the "details" link in the checks.

zaneselvans

One design issue I see here is that you're introducing coupling between the Package, ETLGroup, and Resource classes. Package should know how to produce RST representing a Package. ETLGroup should know how to produce RST representing an ETLGroup, and Resource should know how to output RST representing a Resource.

These output routines can be nested and have some parameters, but keep the logic for each class within that class.

# Iterates over ETLGroups within package, calling ETLGroup.to_rst()
package.to_rst(by="etl_group")

# Iterates over Resources within package, calling Resource.to_rst()
package.to_rst(by="resource")

And rather than passing in a random Jinja template to Package, it could know which template to use internally, based on which kind of RST output it's generating (by ETLGroup vs. by Resource)

I'm not entirely sure what about these changes is breaking the ETL, but my guess is it has to do with the places where you've put an ETLGroup class in a dictionary of parameters, rather than a dictionary that can be used to construct ETLGroup.

zaneselvans · 2023-03-02T17:37:20Z

docs/data_dictionaries/index.rst

@@ -5,7 +5,7 @@ Data Dictionaries

 .. toctree::
   :caption: Data Processed & Cleaned by PUDL
-   :maxdepth: 1
+   :maxdepth: 2


Can we maintain a direct link to pudl_db_fields in the Data Dictionaries section of the navigation sidebar? right now it seems like the only way to reach it is by navigating through the ETL Group page. If someone knows they want to go do Ctrl-F it should be easy to get there.

zaneselvans · 2023-03-02T17:40:05Z

docs/templates/package.rst.jinja

@@ -1,11 +1,24 @@
+:orphan:


Would just removing this get it back in the navigation sidebar?

zaneselvans · 2023-03-02T17:45:47Z

docs/templates/resource.rst.jinja

-------------------------------------------------------------------------------
-{{ resource.name }}
-------------------------------------------------------------------------------
-
-{{ resource.description | wordwrap(78) if resource.description else
-'No table description available.' }}
-`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__
-


The removal of this part of the resource output means that the Resource.to_rst() is no longer a representation of a Resource as RST. This seems bad organizationally, and couples Resource output with the Package class unnecessarily and confusingly.

Instead I think you could make output of the resource level description text + resource schema conditional with a a parameter, keeping all of the resource output logic inside the resource class, but still allowing other code to get at only the portion of that available output that they want based on what arguments they pass in.

zaneselvans · 2023-03-02T17:46:53Z

docs/templates/package.rst.jinja

+{{ resource.name }}
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+{{ resource.description | wordwrap(78) if resource.description else
+'No table description available.' }}
+`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__
+


Resource-specific output probably does not belong in the package class or RST template.

zaneselvans · 2023-03-02T17:53:01Z

docs/templates/etl_group.rst.jinja

+{% for group, resources in package.get_resource_by_etl_group().items() %}
+{{ package.get_etl_group()[group].title }}
+-------------------------------------------------------------------------------
+
+{{ package.get_etl_group()[group].description }}
+
+{% for resource in resources %}
+- :ref:`{{ resource.name }}`
+
+{% endfor %}
+{% endfor %}


If you want to be able out output all of the ETLGroups as RST, it seems like creating an ETLGroup.to_rst() method would be cleaner. It could encapsulate all of the logic of producing the RST for the ETLGroup -- whatever text belongs at the top of the ETLGroup + iterating through all of the Resources that are part of the ETLGroup and having each of them output themselves with Resource.to_rst().

zaneselvans · 2023-03-02T18:39:48Z

src/pudl/metadata/constants.py

+    "eia860": {
+        "title": "EIA 860",
+        "description": """Tables derrived from the EIA Form 860. See our
+:doc:`../data_sources/eia860` page for more information.""",


Embedding hard-coded relative paths to particular RST documents within the string descriptions is going to be fragile. It means there's only one location in the documentation that this description can be used to refer to that other page. If it's an absolute document path instead it would work from anywhere (so long as we don't change the structure of the documentation files), but ideally this should probably be a separate piece of metadata that's attached to the data source, and each ETLGroup class could have an attribute that stores that information separate from the text description, and it could be integrated into the Jinja template used to output the RST.

But since the etl group organization is probably going to impacted or replaced by the use of Dagster to organize the ETL, I'm not sure it makes sense to spend time on that.

zaneselvans · 2023-03-02T18:41:47Z

src/pudl/metadata/classes.py

+    def get_resource_by_etl_group(self) -> dict[str, list[Resource]]:
+        """blah."""
+        resource_dict = {}
+        for etl_group in ETL_GROUPS.keys():
+            resource_dict[etl_group] = [
+                resource
+                for resource in self.resources
+                if resource.etl_group == ETLGroup.from_id(etl_group)
+            ]
+        return resource_dict


This logic seems like it belongs in the ETLGroup not the Package. Then a Package could look up all its own ETL Groups, each ETL Group could look up its constituent resources.

zaneselvans · 2023-03-02T18:43:22Z

test/unit/harvest_test.py

@@ -35,6 +35,7 @@ def _assert_frame_equal(a: pd.DataFrame, b: pd.DataFrame, **kwargs: Any) -> None
        ],
        "primary_key": ["i", "j"],
    },
+    "etl_group": ETLGroup.from_id("static_pudl"),


I think you may want a dictionary representation of the class here, not an instantiated class itself.

zaneselvans · 2023-03-02T18:47:18Z

test/unit/harvest_test.py

@@ -338,6 +347,7 @@ def test_resource_with_only_key_fields_harvests() -> None:
    d["schema"]["fields"] = [
        {"name": name, "type": FIELD_DTYPES[name]} for name in d["schema"]["fields"]
    ]
+    d["etl_group"] = ETLGroup.from_id(d["etl_group"])


And again here, do you really want the instantiated Pydantic model? Or a dictionary that can be used to instantiate it when unpacked?

zaneselvans · 2023-03-02T18:50:50Z

docs/conf.py

    for name in included_sources:
        source = DataSource.from_id(name)
-        source_resources = [res for res in package.resources if res.etl_group == name]
+        source_resources = [
+            res for res in package.resources if res.etl_group == ETLGroup.from_id(name)


Here you're comparing two instances of ETLGroup using ==. Are you sure that does what you want? How is equality defined for these classes? Do you really want to check whether the ETLGroup names (which are strings) are the same?

aesharpe · 2023-03-02T19:53:57Z

Thanks for the comments @zaneselvans!

In the near term we're going to need to restructure a lot of this using something other than ETL group, which is really just an ad-hoc grouping of ETL tasks, based on where various tables were getting processed in our old ETL structure, which is completely replaced by Dagster.

My intention is to create a sub-resource table metadata grouping structure that we can build off when we implement the Dagster branch. Which aspects of this PR do you think are worth investing time in between now and then? I could see us changing the name from etl_group to something else down the road if that type of categorization becomes obsolete.

zaneselvans · 2023-03-02T20:48:41Z

I agree that with hundreds of tables across all of the databases we're producing now, and the prospect of ~100 tables even just in the main PUDL DB once the Dagster refactor is done, some kind of categorization and grouping of the tables is definitely required for users to navigate the available data. I think what those categories / groupings should look like was the topic of this discussion.

I think those conceptual / organizational groupings should probably be pretty distinct from the software / data pipeline, and using our ad-hoc ETL Groups for that purpose here entangles structural ETL stuff with documentation / semantic categories in a way that I'm not sure is helpful. The ETL groups were really just added as a way to address groups of tables that needed to be treated similarly in the old ETL. It was a little hack.

I think there are probably multiple dimensions that we're going to want to use to organize the new universe of tables:

What stage of the data pipeline is this data from? Is it freshly extracted? Cleaned? Normalized? Denormalized? Derived outputs?
Where did the data come from? Is it EIA-860? EIA-923? FERC-1? A combination of different sources?

I suspect that we're going to end up with a large number of tables that aren't meant for direct use by most kinds of users: tables from early in the pipeline, highly normalized tables, skinny tables that contain imputed values or particular analytical outputs, maybe crosswalks that allow joins between tables that were hard to work with like FERC1+EIA. These are the building blocks that can be assembled by us or others comfortable with databases and Python to build new things, or track and debug existing processes. We probably won't want to expose all of these tables in the documentation (or at least, not all in the same part of the documentation)

Then we'll also probably have a smaller number of wide tables with tons of denormalized data and imputed / estimated / aggregated values in them, that most people use for analysis, selecting the subset of columns they actually need for whatever purpose they have in mind.

So I think a useful thing in terms of producing table-level documentation would probably be to allow the generation of documentation based on an arbitrary group of tables (maybe a ResourceGroup?) that is conceptually separated from the data pipeline, and can be customized to produce different docs outputs on different pages, not necessarily showing all of the tables in the database.

Datasette Templates

I think getting some kind of database-level descriptions integrated into the top level https://data.catalyst.coop page with a custom Datasette Jinja template would be helpful for directing people to the right outputs (especially with the XBRL vs. DBF vs. PUDL versions of FERC 1 data)

I think getting the existing table-level descriptions onto the individual database pages on Datasette would be helpful, e.g. on https://data.catalyst.coop/pudl have a table description under the table name, and before the list of columns. I think this would mean a custom Datasette Jinja template.

aesharpe added 5 commits February 22, 2023 22:57

Update data dictionary docs page so that there are groups of etl_grou…

26aa741

…p types not just one long alphabetical list of tables. Took out the table fields for now, want to add them back in later

Add pudl_db_fields.rst file and links that allow you to click back an…

ed394ed

…d forth between version with columns and version with just table descriptions

Make ETLGroup class and add some descriptions to ETL_GROUPS constant.…

01a670d

… Add etl_group description to the docs, fix the harvest_test.py tests so that etl_groups are accurately represented. Add etl_group to the docstrings in the classes module.

Rename test.rst.jinja to etl_group.rst.jinja and remove add_fields pa…

868a67a

…rat for Package.to_rst()

Add more descriptions to etl_groups constants

79e07b7

aesharpe requested review from zaneselvans and cmgosnell February 27, 2023 21:25

aesharpe self-assigned this Feb 27, 2023

aesharpe requested a review from bendnorman February 27, 2023 21:26

aesharpe commented Feb 27, 2023

View reviewed changes

Add in etl_group.rst.jinja jinja template that I forgot. Also uncomme…

295886a

…nt pudl_db_fields from cleanup_rsts so it gets removed

zaneselvans added docs Documentation for users and contributors. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. labels Mar 2, 2023

zaneselvans requested changes Mar 2, 2023

View reviewed changes

Base automatically changed from dev to main January 5, 2024 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure data dictionary docs and add a class for etl_groups #2336

Restructure data dictionary docs and add a class for etl_groups #2336

aesharpe commented Feb 27, 2023 •

edited

Loading

aesharpe Feb 27, 2023

zaneselvans Mar 2, 2023

zaneselvans commented Mar 2, 2023

zaneselvans left a comment

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

zaneselvans Mar 2, 2023

aesharpe commented Mar 2, 2023 •

edited

Loading

zaneselvans commented Mar 2, 2023

Restructure data dictionary docs and add a class for etl_groups #2336

Are you sure you want to change the base?

Restructure data dictionary docs and add a class for etl_groups #2336

Conversation

aesharpe commented Feb 27, 2023 • edited Loading

PR Overview

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Mar 2, 2023

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe commented Mar 2, 2023 • edited Loading

zaneselvans commented Mar 2, 2023

Datasette Templates

aesharpe commented Feb 27, 2023 •

edited

Loading

aesharpe commented Mar 2, 2023 •

edited

Loading