Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak metadata and pyarrow schema methods to work for all tables #3222

Merged
merged 4 commits into from
Jan 9, 2024

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Jan 8, 2024

Overview

What problem/need does this address?

Previously, we were unable to use the Resource.to_pyarrow() method to generate PyArrow schemas for all of our tables, and had historically only used it for the billion-row EPA CEMS table. It turns out that the issues preventing the schemas for being valid were minor and easy to fix:

  • Some tables and fields were missing a description field, which we were assuming would be present and trying to write into the Parquet metadata.
  • Some tables and fields don't have a primary_key but we were assuming one always existed, and trying to write that into the Parquet metadata as well.

What did you change?

  • I added a few missing titles / descriptions and also added a conditional that uses an empty string if the description is None.
  • I added a conditional that only writes the primary_key to the Parquet metadata if it is not None.
  • Added a unit test ensuring that all Resources can generate their PyArrow schemas.
  • Added a unit test ensuring that all defined fields can actually be instantiated (we had some bad ones with typos that were never used)
  • Removed the unfinished and problematic entity_types_eia table (and renamed its vestigial form core_eia__codes_entity_types)
  • Added some currently xfail tests for: unused fields, fields without descriptions, and resources without descriptions.

Part of #3102
Followup in #3224

Testing

I iterated through every Resource we generate, and attempted to create a PyArrow schema from it, and then attempted to use that schema to write the table out to a Parquet file successfully.

To-do list

@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. parquet Issues related to the Apache Parquet file format which we use for long tables. data-types Dtype conversions, standardization and implications of data types labels Jan 8, 2024
@zaneselvans zaneselvans self-assigned this Jan 8, 2024
@zaneselvans zaneselvans linked an issue Jan 8, 2024 that may be closed by this pull request
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the primary key not always existing, but do we not want to enforce all columns and tables to have a description? That seems to be something worth enforcing, even though enforcing it at the site of writing it to Parquet doesn't seem ideal. Can we use pydantic to require all fields/tables to have a description from jump?

@@ -624,7 +624,9 @@ def to_pyarrow(self) -> pa.Field:
name=self.name,
type=self.to_pyarrow_dtype(),
nullable=(not self.constraints.required),
metadata={"description": self.description},
metadata={
"description": self.description if self.description is not None else ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we would not want to force all of our columns to have descriptions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, no. We want them all to have descriptions.

Practically, this is a bit of work. In the case of the columns, many of the ones that didn't have descriptions were extremely generic, and had notes about the need to disambiguate them between the many different tables that they show up in, since they don't actually always mean the same thing (so they at least need specific descriptions, and maybe they need to be renamed so that there's closer to a 1:1 mapping between column name and description).

And then there's the EIA-861, which has dozens of tables and hundreds of columns with no descriptions.

We can certainly use Pydantic to force them all to have descriptions (and we probably should) but there's probably a couple of days worth of work to get everything filled in with clear and accurate metadata. Since we don't actually know what all of these columns are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to split this off into its own issue if we're able to catalog all the columns that don't currently have descriptions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #3224 enumerating:

  • resources without descriptions
  • unused fields
  • fields without descriptions

- Excise non-working core_eia__codes_entity_types table.
- Fix a couple of typos in field metadata
- Add unit test to identify unused fields
- Add unit test to check for descriptions in all fields & resources
- Add unit test that ensures all resources can generate PyArrow schemas
@@ -137,7 +137,7 @@ def data_dictionary_metadata_to_rst(app):
"""Export data dictionary metadata to RST for inclusion in the documentation."""
# Create an RST Data Dictionary for the PUDL DB:
print("Exporting PUDL DB data dictionary metadata to RST.")
skip_names = ["datasets", "accumulated_depreciation_ferc1", "entity_types_eia"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The never-quite-finished entity_types_eia table (now core_eia__codes_entity_types) was malingering and creating weird special cases so I excised it.

src/pudl/metadata/codes.py Show resolved Hide resolved
Comment on lines -1731 to +1740
"descripion": "The name of the EIA operator utility.",
"description": "The name of the EIA operator utility.",
},
"operator_state": {
"type": "string",
"description": "The state where the operator utility is located.",
},
"operator_utility_id_eia": {
"type": "integer",
"descrption": "The EIA utility Identification number for the operator utility.",
"description": "The EIA utility Identification number for the operator utility.",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos which were not causing errors because these fields aren't actually used anywhere, and so are never instantiated.

Comment on lines +59 to +60
"title": "PUDL Utility-Plant Associations",
"description": "Associations between PUDL utility IDs and PUDL plant IDs. This table is read in from a spreadsheet stored in the PUDL repository: src/pudl/package_data/glue/pudl_id_mapping.xlsx",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few tables that I added titles and descriptions to before realizing there were dozens, an implemented the hacky workaround.

test/unit/metadata_test.py Show resolved Hide resolved
Comment on lines +47 to +50
@pytest.mark.parametrize("resource_name", sorted(PUDL_RESOURCES.keys()))
def test_pyarrow_schemas(resource_name: str):
"""Verify that we can produce pyarrow schemas for all defined Resources."""
_ = PUDL_RESOURCES[resource_name].to_pyarrow()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the main new test I wanted to add -- just so we catch any changes which happen to invalidate our PyArrow schemas (especially since we aren't actually using them yet, so it wouldn't break the builds)

test/unit/metadata_test.py Show resolved Hide resolved
Comment on lines +59 to +70
@pytest.mark.xfail(reason="Need to purge unused fields. See issue #3224")
def test_defined_fields_are_used():
"""Check that all fields which are defined are actually used."""
used_fields = set()
for resource in PUDL_RESOURCES.values():
used_fields |= {f.name for f in resource.schema.fields}
defined_fields = set(FIELD_METADATA.keys())
unused_fields = sorted(defined_fields - used_fields)
if len(unused_fields) > 0:
raise AssertionError(
f"Found {len(unused_fields)} unused fields: {unused_fields}"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are good reasons to have unused fields laying around. I suspect that the ones we've got are just leftovers where we renamed a column and never purged the old definition.

src/pudl/metadata/resources/eia.py Show resolved Hide resolved
test/unit/metadata_test.py Show resolved Hide resolved
Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! I'm glad this was an easy fix.

src/pudl/metadata/codes.py Show resolved Hide resolved
test/unit/metadata_test.py Show resolved Hide resolved
@zaneselvans zaneselvans marked this pull request as ready for review January 9, 2024 20:21
@zaneselvans zaneselvans enabled auto-merge (squash) January 9, 2024 20:40
@zaneselvans zaneselvans merged commit d62dd5c into main Jan 9, 2024
11 checks passed
@zaneselvans zaneselvans deleted the pyarrow-schema-fixes branch January 9, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-types Dtype conversions, standardization and implications of data types metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Output PUDL as Parquet as well as SQLite
3 participants