Tweak metadata and pyarrow schema methods to work for all tables #3222

zaneselvans · 2024-01-08T23:06:16Z

Overview

What problem/need does this address?

Previously, we were unable to use the Resource.to_pyarrow() method to generate PyArrow schemas for all of our tables, and had historically only used it for the billion-row EPA CEMS table. It turns out that the issues preventing the schemas for being valid were minor and easy to fix:

Some tables and fields were missing a description field, which we were assuming would be present and trying to write into the Parquet metadata.
Some tables and fields don't have a primary_key but we were assuming one always existed, and trying to write that into the Parquet metadata as well.

What did you change?

I added a few missing titles / descriptions and also added a conditional that uses an empty string if the description is None.
I added a conditional that only writes the primary_key to the Parquet metadata if it is not None.
Added a unit test ensuring that all Resources can generate their PyArrow schemas.
Added a unit test ensuring that all defined fields can actually be instantiated (we had some bad ones with typos that were never used)
Removed the unfinished and problematic entity_types_eia table (and renamed its vestigial form core_eia__codes_entity_types)
Added some currently xfail tests for: unused fields, fields without descriptions, and resources without descriptions.

Part of #3102
Followup in #3224

Testing

I iterated through every Resource we generate, and attempted to create a PyArrow schema from it, and then attempted to use that schema to write the table out to a Parquet file successfully.

To-do list

Give feedback

Disable the confusing entity_types_eia table which does not actually exist.
Add an integration test that attempts to create PyArrow schemas for all resources.
Review the PR yourself and call out any questions or issues you have
Make sure full ETL runs locally (especially EPA CEMS)
Options

e-belfer

I understand the primary key not always existing, but do we not want to enforce all columns and tables to have a description? That seems to be something worth enforcing, even though enforcing it at the site of writing it to Parquet doesn't seem ideal. Can we use pydantic to require all fields/tables to have a description from jump?

e-belfer · 2024-01-09T14:29:22Z

src/pudl/metadata/classes.py

@@ -624,7 +624,9 @@ def to_pyarrow(self) -> pa.Field:
            name=self.name,
            type=self.to_pyarrow_dtype(),
            nullable=(not self.constraints.required),
-            metadata={"description": self.description},
+            metadata={
+                "description": self.description if self.description is not None else ""


Is there any reason we would not want to force all of our columns to have descriptions?

Ideally, no. We want them all to have descriptions.

Practically, this is a bit of work. In the case of the columns, many of the ones that didn't have descriptions were extremely generic, and had notes about the need to disambiguate them between the many different tables that they show up in, since they don't actually always mean the same thing (so they at least need specific descriptions, and maybe they need to be renamed so that there's closer to a 1:1 mapping between column name and description).

And then there's the EIA-861, which has dozens of tables and hundreds of columns with no descriptions.

We can certainly use Pydantic to force them all to have descriptions (and we probably should) but there's probably a couple of days worth of work to get everything filled in with clear and accurate metadata. Since we don't actually know what all of these columns are.

I'm ok to split this off into its own issue if we're able to catalog all the columns that don't currently have descriptions.

I've created #3224 enumerating:

resources without descriptions

unused fields

fields without descriptions

- Excise non-working core_eia__codes_entity_types table. - Fix a couple of typos in field metadata - Add unit test to identify unused fields - Add unit test to check for descriptions in all fields & resources - Add unit test that ensures all resources can generate PyArrow schemas

zaneselvans · 2024-01-09T18:32:11Z

docs/conf.py

@@ -137,7 +137,7 @@ def data_dictionary_metadata_to_rst(app):
    """Export data dictionary metadata to RST for inclusion in the documentation."""
    # Create an RST Data Dictionary for the PUDL DB:
    print("Exporting PUDL DB data dictionary metadata to RST.")
-    skip_names = ["datasets", "accumulated_depreciation_ferc1", "entity_types_eia"]


The never-quite-finished entity_types_eia table (now core_eia__codes_entity_types) was malingering and creating weird special cases so I excised it.

src/pudl/metadata/codes.py

zaneselvans · 2024-01-09T18:33:34Z

src/pudl/metadata/fields.py

-        "descripion": "The name of the EIA operator utility.",
+        "description": "The name of the EIA operator utility.",
    },
    "operator_state": {
        "type": "string",
        "description": "The state where the operator utility is located.",
    },
    "operator_utility_id_eia": {
        "type": "integer",
-        "descrption": "The EIA utility Identification number for the operator utility.",
+        "description": "The EIA utility Identification number for the operator utility.",


Typos which were not causing errors because these fields aren't actually used anywhere, and so are never instantiated.

zaneselvans · 2024-01-09T18:34:48Z

src/pudl/metadata/resources/pudl.py

+        "title": "PUDL Utility-Plant Associations",
+        "description": "Associations between PUDL utility IDs and PUDL plant IDs. This table is read in from a spreadsheet stored in the PUDL repository: src/pudl/package_data/glue/pudl_id_mapping.xlsx",


A few tables that I added titles and descriptions to before realizing there were dozens, an implemented the hacky workaround.

test/unit/metadata_test.py

zaneselvans · 2024-01-09T18:36:34Z

test/unit/metadata_test.py

+@pytest.mark.parametrize("resource_name", sorted(PUDL_RESOURCES.keys()))
+def test_pyarrow_schemas(resource_name: str):
+    """Verify that we can produce pyarrow schemas for all defined Resources."""
+    _ = PUDL_RESOURCES[resource_name].to_pyarrow()


This was the main new test I wanted to add -- just so we catch any changes which happen to invalidate our PyArrow schemas (especially since we aren't actually using them yet, so it wouldn't break the builds)

test/unit/metadata_test.py

zaneselvans · 2024-01-09T18:38:28Z

test/unit/metadata_test.py

+@pytest.mark.xfail(reason="Need to purge unused fields. See issue #3224")
+def test_defined_fields_are_used():
+    """Check that all fields which are defined are actually used."""
+    used_fields = set()
+    for resource in PUDL_RESOURCES.values():
+        used_fields |= {f.name for f in resource.schema.fields}
+    defined_fields = set(FIELD_METADATA.keys())
+    unused_fields = sorted(defined_fields - used_fields)
+    if len(unused_fields) > 0:
+        raise AssertionError(
+            f"Found {len(unused_fields)} unused fields: {unused_fields}"
+        )


I don't think there are good reasons to have unused fields laying around. I suspect that the ones we've got are just leftovers where we renamed a column and never purged the old definition.

src/pudl/metadata/resources/eia.py

test/unit/metadata_test.py

bendnorman

Yay! I'm glad this was an easy fix.

src/pudl/metadata/codes.py

test/unit/metadata_test.py

Tweak metadata and pyarrow schema methods to work for all tables.

5982442

Merge branch 'main' into pyarrow-schema-fixes

564881c

zaneselvans self-assigned this Jan 8, 2024

zaneselvans linked an issue Jan 8, 2024 that may be closed by this pull request

Output PUDL as Parquet as well as SQLite #3102

Closed

zaneselvans requested a review from e-belfer January 8, 2024 23:24

zaneselvans mentioned this pull request Jan 9, 2024

Output PUDL as Parquet as well as SQLite #3102

Closed

e-belfer reviewed Jan 9, 2024

View reviewed changes

zaneselvans requested review from e-belfer and bendnorman January 9, 2024 18:28

zaneselvans commented Jan 9, 2024

View reviewed changes

bendnorman approved these changes Jan 9, 2024

View reviewed changes

src/pudl/metadata/codes.py Show resolved Hide resolved

test/unit/metadata_test.py Show resolved Hide resolved

zaneselvans marked this pull request as ready for review January 9, 2024 20:21

Only notify Slack about software release if we're releasing

576ed2a

zaneselvans enabled auto-merge (squash) January 9, 2024 20:40

zaneselvans merged commit d62dd5c into main Jan 9, 2024
11 checks passed

zaneselvans deleted the pyarrow-schema-fixes branch January 9, 2024 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak metadata and pyarrow schema methods to work for all tables #3222

Tweak metadata and pyarrow schema methods to work for all tables #3222

zaneselvans commented Jan 8, 2024 •

edited

Loading

To-do list

e-belfer left a comment

e-belfer Jan 9, 2024

zaneselvans Jan 9, 2024

e-belfer Jan 9, 2024

zaneselvans Jan 9, 2024

zaneselvans Jan 9, 2024

zaneselvans Jan 9, 2024

zaneselvans Jan 9, 2024

zaneselvans Jan 9, 2024

zaneselvans Jan 9, 2024

bendnorman left a comment

		"title": "PUDL Utility-Plant Associations",
		"description": "Associations between PUDL utility IDs and PUDL plant IDs. This table is read in from a spreadsheet stored in the PUDL repository: src/pudl/package_data/glue/pudl_id_mapping.xlsx",

Tweak metadata and pyarrow schema methods to work for all tables #3222

Tweak metadata and pyarrow schema methods to work for all tables #3222

Conversation

zaneselvans commented Jan 8, 2024 • edited Loading

Overview

What problem/need does this address?

What did you change?

Testing

To-do list

e-belfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendnorman left a comment

Choose a reason for hiding this comment

zaneselvans commented Jan 8, 2024 •

edited

Loading