Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PUDL to use Pydantic v2 #3051

Merged
merged 31 commits into from
Nov 28, 2023
Merged

Refactor PUDL to use Pydantic v2 #3051

merged 31 commits into from
Nov 28, 2023

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Nov 15, 2023

PR Overview

Refactor PUDL to use Pydantic v2

  • Fix some unit test failures that were incidental -- due to differences in the way Pydantic v2 converts types when multiple basic types are acceptable (v2 will generally preserve the types, if they match any of the allowable types)
  • Create the specified PUDL_INPUT and PUDL_OUTPUT directories in our docs build environments. The new PotentialDirectoryPath requires that the path either be a directory, or be a path where you could immediately create a new directory (without needing to create parent directories).
  • Replace model.dict() with model.model.dump() across the board.
  • Suppress warnings coming from Pydantic v1 syntax in other packages.
  • Switch to explicitly using @classmethod decorators for pydantic validators where appropriate.
  • Swapped a lot of root_validator for model_validator
  • Swapped a lot of validator for field_validator and in some cases turned them into model_validator(mode="after") if they were accessing many attributes and not mutating the class.
  • Tried to use more immediately legible names for the parameters being validated in many of the validators (rather than v or value or values -- since most of our validators only check a single attribute).
  • Switched a bunch of constrained type factories like constr or conint for Annotated[] types.

Other deprecations and warnings:

  • Replaced deprecated df.applymap() with df.map()
  • Remove the zombie Zenodo sandbox DOIs which got wiped.
  • Suppress the 50,000 warnings we're getting about inconsistent date formats in the EIA Bulk Electricity data.
  • Don't use deprecated pd.unique() on types it will stop working with soon.
  • Wrap inputs to pd.ExcelFile() with BytesIO as indicated by deprecation warnings.

Outstanding Issues/Questions:

  • Not sure if the new PudlPaths / PotentialDirectoryPath behavior is actually ideal. But it works and is simple to define.
  • In retrospect, not sure if the changes that got made in the settings classes were necessary, given that we didn't end up trying to turn them into Dagster's ConfigurableResurce classes... but it's working as it is.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans linked an issue Nov 16, 2023 that may be closed by this pull request
@zaneselvans zaneselvans added the dependencies Pull requests that update a dependency file label Nov 18, 2023
@zaneselvans zaneselvans self-assigned this Nov 18, 2023
@zaneselvans zaneselvans force-pushed the pydantic-v2 branch 6 times, most recently from ee201b3 to f8dfbdf Compare November 22, 2023 22:32
@zaneselvans
Copy link
Member Author

Woo! I got the unit and integration tests to pass locally.

@zaneselvans zaneselvans changed the title WIP: Initial refactor for pydantic v2 Refactor PUDL to use Pydantic v2 Nov 23, 2023
src/pudl/settings.py Outdated Show resolved Hide resolved
src/pudl/settings.py Outdated Show resolved Hide resolved
* import missing modules, remove unused demonstration asset module.
* Import working; syntax issues fixed; real unit tests broken.
* Fix unit test failures resulting from new Pydantic type conversion behavior.
* Fix doctest failure.
* Replace deprecated df.applymap() with df.map()
* Make PUDL_INPUT/PUDL_OUTPUT directories before docs build.
* Remove Zenodo sandbox DOIs that got wiped.
* Actually create pudl input/output dirs on RTD.
* Also Address (or suppress) a number of warnings in unit tests.
* Uncomment and revert non-working Pydantic validators.
* Suppress eia_bulk_elec warning about non-uniform date formats.
* Also migrate XBRL calculation error tolerance validations to Pydantic v2
.github/workflows/pytest.yml Show resolved Hide resolved
devtools/materialize_asset.py Show resolved Hide resolved
docs/conf.py Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
Comment on lines +255 to +260
"ignore:The `__fields__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:unittest.mock",
"ignore:The `__fields_set__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:unittest.mock",
"ignore:The `__fields__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:pydantic.main",
"ignore:The `__fields_set__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:pydantic.main",
"ignore:The `update_forward_refs` method is deprecated:pydantic.PydanticDeprecatedSince20:pydantic.main",
"ignore:Support for class-based `config` is deprecated:pydantic.PydanticDeprecatedSince20:pydantic._internal._config",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These warnings are coming from other packages that we don't control, so ignoring.

src/pudl/metadata/classes.py Show resolved Hide resolved
Comment on lines 242 to 277
class Date(BaseType):
"""Any :class:`datetime.date`."""

@classmethod
def validate(cls, value: Any) -> datetime.date:
"""Validate as date."""
if not isinstance(value, datetime.date):
raise TypeError("value is not a date")
return value


class Datetime(BaseType):
"""Any :class:`datetime.datetime`."""

@classmethod
def validate(cls, value: Any) -> datetime.datetime:
"""Validate as datetime."""
if not isinstance(value, datetime.datetime):
raise TypeError("value is not a datetime")
return value


class Pattern(BaseType):
"""Regular expression pattern."""

@classmethod
def validate(cls, value: Any) -> re.Pattern:
"""Validate as pattern."""
if not isinstance(value, str | re.Pattern):
raise TypeError("value is not a string or compiled regular expression")
if isinstance(value, str):
try:
value = re.compile(value)
except re.error:
raise ValueError("string is not a valid regular expression")
return value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datetime.date, datetime.datetime, and re.Pattern are all supported natively by Pydantic v2 with behavior identical to these classes, so I'm just using them directly.

Comment on lines -226 to -231
Email = pydantic.EmailStr
"""String representing an email."""

HttpUrl = pydantic.AnyHttpUrl
"""Http(s) URL."""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using these pydantic types directly for clarity rather than renaming.

field.constraints.required = True
else:
missing.append(field.name)
if missing:
raise ValueError(f"names {missing} missing from fields")
return value

@pydantic.validator("foreign_keys", each_item=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The each_item=True functionality from Pydantic v1 is no longer available, so instead I explicitly iterate over all of the FK constraints.

Comment on lines +1722 to +1736
@model_validator(mode="after")
def _populate_from_resources(self: Self):
"""Populate Package attributes from similar deduplicated Resource attributes.

@pydantic.root_validator(skip_on_failure=True)
def _populate_from_resources(cls, values): # noqa: N805
Resources and Packages share some descriptive attributes. When building a
Package out of a collection of Resources, we want the Package to reflect the
union of all the analogous values found in the Resources, but we don't want
any duplicates. We may also get values directly from the Package inputs.
"""
for key in ("keywords", "contributors", "sources", "licenses"):
values[key] = _unique(
values[key], *[getattr(r, key) for r in values["resources"]]
)
return values
package_value = getattr(self, key)
resource_values = [getattr(resource, key) for resource in self.resources]
deduped_values = _unique(package_value, *resource_values)
setattr(self, key, deduped_values)
return self
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the terse way in which this was written before hard to read, and ended up expanding it while debugging, and left it in the more explicit form.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this creates an infinite recursion when validate_assignment is enabled for the model because of changes in how model validation works.

For the moment I've disabled validate_assignment but that doesn't seem ideal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendnorman Any obvious ideas how we might validate on assignment here without infinite recursion?

Comment on lines 231 to 238
class PudlMeta(BaseModel):
"""A base model that configures some options for PUDL metadata classes."""

model_config = ConfigDict(
extra="forbid",
validate_all=True,
validate_assignment=True,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is a simplified generic version of the old Base class, which sets some model configurations shared across all of our metadata classes below.

@zaneselvans zaneselvans marked this pull request as ready for review November 25, 2023 19:53
Copy link

codecov bot commented Nov 25, 2023

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (5eb88de) 88.7% compared to head (628f41a) 88.7%.
Report is 16 commits behind head on dev.

Files Patch % Lines
src/pudl/transform/ferc1.py 72.7% 6 Missing ⚠️
src/pudl/cli/etl.py 33.3% 2 Missing ⚠️
src/pudl/settings.py 98.3% 2 Missing ⚠️
src/pudl/transform/classes.py 93.7% 2 Missing ⚠️
src/pudl/metadata/classes.py 99.3% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #3051     +/-   ##
=======================================
- Coverage   88.7%   88.7%   -0.1%     
=======================================
  Files         90      89      -1     
  Lines      10995   10957     -38     
=======================================
- Hits        9759    9725     -34     
+ Misses      1236    1232      -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@bendnorman bendnorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Lots of good upgrading and clean up. Thanks @zaneselvans! Some changes I'm super familiar with like in the FERC transform classes but it seems like most of the pydantic changes were swapping in v2 pydnatic classes and functions.

src/pudl/metadata/classes.py Show resolved Hide resolved
src/pudl/metadata/classes.py Show resolved Hide resolved
src/pudl/settings.py Show resolved Hide resolved
src/pudl/settings.py Show resolved Hide resolved
src/pudl/settings.py Show resolved Hide resolved
Comment on lines +681 to +687
def _convert_settings_to_dagster_config(settings_dict: dict[str, Any]) -> None:
"""Recursively convert a dictionary of dataset settings to dagster config in place.

For each partition parameter in a GenericDatasetSettings subclass, create a Noneable
Dagster field with a default value of None. The GenericDatasetSettings
subclasses will default to include all working paritions if the partition value
is None. Get the value type so dagster can do some basic type checking in the UI.
For each partition parameter in a :class:`GenericDatasetSettings` subclass, create a
corresponding :class:`DagsterField`. By default the :class:`GenericDatasetSettings`
subclasses will default to include all working paritions if the partition value is
None. Get the value type so dagster can do some basic type checking in the UI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zaneselvans zaneselvans marked this pull request as draft November 28, 2023 06:49
@zaneselvans zaneselvans marked this pull request as ready for review November 28, 2023 07:40
@zaneselvans zaneselvans merged commit fa74991 into dev Nov 28, 2023
12 of 13 checks passed
@zaneselvans zaneselvans deleted the pydantic-v2 branch November 28, 2023 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update PUDL to use Pydantic 2
2 participants