-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak metadata and pyarrow schema methods to work for all tables #3222
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the primary key not always existing, but do we not want to enforce all columns and tables to have a description? That seems to be something worth enforcing, even though enforcing it at the site of writing it to Parquet doesn't seem ideal. Can we use pydantic to require all fields/tables to have a description from jump?
@@ -624,7 +624,9 @@ def to_pyarrow(self) -> pa.Field: | |||
name=self.name, | |||
type=self.to_pyarrow_dtype(), | |||
nullable=(not self.constraints.required), | |||
metadata={"description": self.description}, | |||
metadata={ | |||
"description": self.description if self.description is not None else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason we would not want to force all of our columns to have descriptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, no. We want them all to have descriptions.
Practically, this is a bit of work. In the case of the columns, many of the ones that didn't have descriptions were extremely generic, and had notes about the need to disambiguate them between the many different tables that they show up in, since they don't actually always mean the same thing (so they at least need specific descriptions, and maybe they need to be renamed so that there's closer to a 1:1 mapping between column name and description).
And then there's the EIA-861, which has dozens of tables and hundreds of columns with no descriptions.
We can certainly use Pydantic to force them all to have descriptions (and we probably should) but there's probably a couple of days worth of work to get everything filled in with clear and accurate metadata. Since we don't actually know what all of these columns are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok to split this off into its own issue if we're able to catalog all the columns that don't currently have descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #3224 enumerating:
- resources without descriptions
- unused fields
- fields without descriptions
- Excise non-working core_eia__codes_entity_types table. - Fix a couple of typos in field metadata - Add unit test to identify unused fields - Add unit test to check for descriptions in all fields & resources - Add unit test that ensures all resources can generate PyArrow schemas
@@ -137,7 +137,7 @@ def data_dictionary_metadata_to_rst(app): | |||
"""Export data dictionary metadata to RST for inclusion in the documentation.""" | |||
# Create an RST Data Dictionary for the PUDL DB: | |||
print("Exporting PUDL DB data dictionary metadata to RST.") | |||
skip_names = ["datasets", "accumulated_depreciation_ferc1", "entity_types_eia"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The never-quite-finished entity_types_eia
table (now core_eia__codes_entity_types
) was malingering and creating weird special cases so I excised it.
"descripion": "The name of the EIA operator utility.", | ||
"description": "The name of the EIA operator utility.", | ||
}, | ||
"operator_state": { | ||
"type": "string", | ||
"description": "The state where the operator utility is located.", | ||
}, | ||
"operator_utility_id_eia": { | ||
"type": "integer", | ||
"descrption": "The EIA utility Identification number for the operator utility.", | ||
"description": "The EIA utility Identification number for the operator utility.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typos which were not causing errors because these fields aren't actually used anywhere, and so are never instantiated.
"title": "PUDL Utility-Plant Associations", | ||
"description": "Associations between PUDL utility IDs and PUDL plant IDs. This table is read in from a spreadsheet stored in the PUDL repository: src/pudl/package_data/glue/pudl_id_mapping.xlsx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few tables that I added titles and descriptions to before realizing there were dozens, an implemented the hacky workaround.
@pytest.mark.parametrize("resource_name", sorted(PUDL_RESOURCES.keys())) | ||
def test_pyarrow_schemas(resource_name: str): | ||
"""Verify that we can produce pyarrow schemas for all defined Resources.""" | ||
_ = PUDL_RESOURCES[resource_name].to_pyarrow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the main new test I wanted to add -- just so we catch any changes which happen to invalidate our PyArrow schemas (especially since we aren't actually using them yet, so it wouldn't break the builds)
@pytest.mark.xfail(reason="Need to purge unused fields. See issue #3224") | ||
def test_defined_fields_are_used(): | ||
"""Check that all fields which are defined are actually used.""" | ||
used_fields = set() | ||
for resource in PUDL_RESOURCES.values(): | ||
used_fields |= {f.name for f in resource.schema.fields} | ||
defined_fields = set(FIELD_METADATA.keys()) | ||
unused_fields = sorted(defined_fields - used_fields) | ||
if len(unused_fields) > 0: | ||
raise AssertionError( | ||
f"Found {len(unused_fields)} unused fields: {unused_fields}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there are good reasons to have unused fields laying around. I suspect that the ones we've got are just leftovers where we renamed a column and never purged the old definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! I'm glad this was an easy fix.
Overview
What problem/need does this address?
Previously, we were unable to use the
Resource.to_pyarrow()
method to generate PyArrow schemas for all of our tables, and had historically only used it for the billion-row EPA CEMS table. It turns out that the issues preventing the schemas for being valid were minor and easy to fix:description
field, which we were assuming would be present and trying to write into the Parquet metadata.primary_key
but we were assuming one always existed, and trying to write that into the Parquet metadata as well.What did you change?
None
.primary_key
to the Parquet metadata if it is notNone
.entity_types_eia
table (and renamed its vestigial formcore_eia__codes_entity_types
)xfail
tests for: unused fields, fields without descriptions, and resources without descriptions.Part of #3102
Followup in #3224
Testing
I iterated through every
Resource
we generate, and attempted to create a PyArrow schema from it, and then attempted to use that schema to write the table out to a Parquet file successfully.To-do list