-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PUDL to use Pydantic v2 #3051
Conversation
c918a95
to
cf82eb3
Compare
ee201b3
to
f8dfbdf
Compare
Woo! I got the unit and integration tests to pass locally. |
27245e3
to
65a15a0
Compare
ca173dc
to
c9dac6d
Compare
f488baa
to
d7ccdd2
Compare
* import missing modules, remove unused demonstration asset module. * Import working; syntax issues fixed; real unit tests broken. * Fix unit test failures resulting from new Pydantic type conversion behavior. * Fix doctest failure. * Replace deprecated df.applymap() with df.map() * Make PUDL_INPUT/PUDL_OUTPUT directories before docs build. * Remove Zenodo sandbox DOIs that got wiped. * Actually create pudl input/output dirs on RTD.
* Also Address (or suppress) a number of warnings in unit tests.
* Uncomment and revert non-working Pydantic validators. * Suppress eia_bulk_elec warning about non-uniform date formats.
* Also migrate XBRL calculation error tolerance validations to Pydantic v2
70f49b3
to
a8423da
Compare
"ignore:The `__fields__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:unittest.mock", | ||
"ignore:The `__fields_set__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:unittest.mock", | ||
"ignore:The `__fields__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:pydantic.main", | ||
"ignore:The `__fields_set__` attribute is deprecated:pydantic.PydanticDeprecatedSince20:pydantic.main", | ||
"ignore:The `update_forward_refs` method is deprecated:pydantic.PydanticDeprecatedSince20:pydantic.main", | ||
"ignore:Support for class-based `config` is deprecated:pydantic.PydanticDeprecatedSince20:pydantic._internal._config", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These warnings are coming from other packages that we don't control, so ignoring.
src/pudl/metadata/classes.py
Outdated
class Date(BaseType): | ||
"""Any :class:`datetime.date`.""" | ||
|
||
@classmethod | ||
def validate(cls, value: Any) -> datetime.date: | ||
"""Validate as date.""" | ||
if not isinstance(value, datetime.date): | ||
raise TypeError("value is not a date") | ||
return value | ||
|
||
|
||
class Datetime(BaseType): | ||
"""Any :class:`datetime.datetime`.""" | ||
|
||
@classmethod | ||
def validate(cls, value: Any) -> datetime.datetime: | ||
"""Validate as datetime.""" | ||
if not isinstance(value, datetime.datetime): | ||
raise TypeError("value is not a datetime") | ||
return value | ||
|
||
|
||
class Pattern(BaseType): | ||
"""Regular expression pattern.""" | ||
|
||
@classmethod | ||
def validate(cls, value: Any) -> re.Pattern: | ||
"""Validate as pattern.""" | ||
if not isinstance(value, str | re.Pattern): | ||
raise TypeError("value is not a string or compiled regular expression") | ||
if isinstance(value, str): | ||
try: | ||
value = re.compile(value) | ||
except re.error: | ||
raise ValueError("string is not a valid regular expression") | ||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetime.date
, datetime.datetime
, and re.Pattern
are all supported natively by Pydantic v2 with behavior identical to these classes, so I'm just using them directly.
Email = pydantic.EmailStr | ||
"""String representing an email.""" | ||
|
||
HttpUrl = pydantic.AnyHttpUrl | ||
"""Http(s) URL.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using these pydantic types directly for clarity rather than renaming.
field.constraints.required = True | ||
else: | ||
missing.append(field.name) | ||
if missing: | ||
raise ValueError(f"names {missing} missing from fields") | ||
return value | ||
|
||
@pydantic.validator("foreign_keys", each_item=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The each_item=True
functionality from Pydantic v1 is no longer available, so instead I explicitly iterate over all of the FK constraints.
@model_validator(mode="after") | ||
def _populate_from_resources(self: Self): | ||
"""Populate Package attributes from similar deduplicated Resource attributes. | ||
|
||
@pydantic.root_validator(skip_on_failure=True) | ||
def _populate_from_resources(cls, values): # noqa: N805 | ||
Resources and Packages share some descriptive attributes. When building a | ||
Package out of a collection of Resources, we want the Package to reflect the | ||
union of all the analogous values found in the Resources, but we don't want | ||
any duplicates. We may also get values directly from the Package inputs. | ||
""" | ||
for key in ("keywords", "contributors", "sources", "licenses"): | ||
values[key] = _unique( | ||
values[key], *[getattr(r, key) for r in values["resources"]] | ||
) | ||
return values | ||
package_value = getattr(self, key) | ||
resource_values = [getattr(resource, key) for resource in self.resources] | ||
deduped_values = _unique(package_value, *resource_values) | ||
setattr(self, key, deduped_values) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the terse way in which this was written before hard to read, and ended up expanding it while debugging, and left it in the more explicit form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, this creates an infinite recursion when validate_assignment
is enabled for the model because of changes in how model validation works.
For the moment I've disabled validate_assignment
but that doesn't seem ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bendnorman Any obvious ideas how we might validate on assignment here without infinite recursion?
src/pudl/metadata/classes.py
Outdated
class PudlMeta(BaseModel): | ||
"""A base model that configures some options for PUDL metadata classes.""" | ||
|
||
model_config = ConfigDict( | ||
extra="forbid", | ||
validate_all=True, | ||
validate_assignment=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is a simplified generic version of the old Base
class, which sets some model configurations shared across all of our metadata classes below.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #3051 +/- ##
=======================================
- Coverage 88.7% 88.7% -0.1%
=======================================
Files 90 89 -1
Lines 10995 10957 -38
=======================================
- Hits 9759 9725 -34
+ Misses 1236 1232 -4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Lots of good upgrading and clean up. Thanks @zaneselvans! Some changes I'm super familiar with like in the FERC transform classes but it seems like most of the pydantic changes were swapping in v2 pydnatic classes and functions.
def _convert_settings_to_dagster_config(settings_dict: dict[str, Any]) -> None: | ||
"""Recursively convert a dictionary of dataset settings to dagster config in place. | ||
|
||
For each partition parameter in a GenericDatasetSettings subclass, create a Noneable | ||
Dagster field with a default value of None. The GenericDatasetSettings | ||
subclasses will default to include all working paritions if the partition value | ||
is None. Get the value type so dagster can do some basic type checking in the UI. | ||
For each partition parameter in a :class:`GenericDatasetSettings` subclass, create a | ||
corresponding :class:`DagsterField`. By default the :class:`GenericDatasetSettings` | ||
subclasses will default to include all working paritions if the partition value is | ||
None. Get the value type so dagster can do some basic type checking in the UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
PR Overview
Refactor PUDL to use Pydantic v2
PUDL_INPUT
andPUDL_OUTPUT
directories in our docs build environments. The newPotentialDirectoryPath
requires that the path either be a directory, or be a path where you could immediately create a new directory (without needing to create parent directories).model.dict()
withmodel.model.dump()
across the board.@classmethod
decorators for pydantic validators where appropriate.root_validator
formodel_validator
validator
forfield_validator
and in some cases turned them intomodel_validator(mode="after")
if they were accessing many attributes and not mutating the class.v
orvalue
orvalues
-- since most of our validators only check a single attribute).constr
orconint
forAnnotated[]
types.Other deprecations and warnings:
df.applymap()
withdf.map()
pd.unique()
on types it will stop working with soon.pd.ExcelFile()
withBytesIO
as indicated by deprecation warnings.Outstanding Issues/Questions:
PudlPaths
/PotentialDirectoryPath
behavior is actually ideal. But it works and is simple to define.ConfigurableResurce
classes... but it's working as it is.PR Checklist
dev
).