-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues arising from pandas
v2.1 & ferc-xbrl-extractor
v1.1.1
#2854
Fix issues arising from pandas
v2.1 & ferc-xbrl-extractor
v1.1.1
#2854
Conversation
Unclear to me what is going on with the |
529c10d
to
1eaef5a
Compare
3d6912b
to
8676c93
Compare
…ractor' into pandas-2.1
…ractor' into pandas-2.1
…ractor' into pandas-2.1
…ractor' into pandas-2.1
@zschira It seems like something on this branch has made the XBRL extraction process catastrophically slow. It doesn't run out of memory, but it takes more than 6 hours, so the CI can't even complete within the allotted time. See this failed run. I ran
But after the pause things continued as normal, with the entire Tox run taking nearly 8 hours to complete. |
Running $ ferc_to_sqlite --clobber --gcs-cache-path gs://zenodo-cache.catalyst.coop src/pudl/package_data/settings/etl_full.yml locally also results in multi-hour hangs in the XBRL extraction process. Doesn't seem very informative, but I did a Ctrl-C and got:
|
This behavior doesn't seem to show up in #2821 and the only thing that seems like it could be making a difference is the switch from pandas 2.0.x to 2.1.x, so I'm going to re-install pandas 2.0.3 and re-run the script and see if it behaves wildly differently... Is there anything in the Pandas 2.1 release notes that seems like it might have impacted what we're doing? Most of the things I saw were related to data types. Maybe we are somehow ending up with a bunch of It looks like this option can make PyArrow strings the default reflecting future behavior. pd.options.future.infer_string = True With the only change being going from pandas v2.1.1 to v2.0.3 |
There are some problems due to foreign key constraint violations & unmapped utilities, but those seem similar to what's happening in #2821 already. There are also a few small changes in the number of records in the FERC 1 outputs, but I don't know if these are the same as the ones that have popped up in #2821, or if they're a result of the change I made in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR leaked over into ferc-xbrl-extractor
because beyond pandas API changes in this repo, there was a big performance bottleneck over in that repo, which was addressed in catalyst-cooperative/ferc-xbrl-extractor#144
The CI in this repo doesn't pass right now but I think that's for reasons unrelated to the pandas 2.1 update and the changes to ferc-xbrl-extractor
-- I think the change in row counts and foreign key constraints / unmapped plants and utilities also exist in #2821 but should compare notes with @jdangerx.
We should probably merge this into the upstream branch before that branch hits dev
because of the performance & compatibility issues with the ferc-xbrl-extractor==1.1.1
update.
ferc1_solo -> Test whether FERC 1 can be loaded into the PUDL database alone. | ||
integration -> Run all software integration tests and process a full year of data. | ||
minmax_rows -> Check that all outputs have the expected number of rows. | ||
validate -> Run all data validation tests. This requires a complete PUDL DB. | ||
ferc1_schema -> Verify FERC Form 1 DB schema are compatible for all years. | ||
jupyter -> Ensure that designated Jupyter notebooks can be executed. | ||
full_integration -> Run ETL and integration tests for all years and data sources. | ||
full -> Run all CI checks, but for all years of data. | ||
build -> Prepare Python source and binary packages for release. | ||
testrelease -> Do a dry run of Python package release using the PyPI test server. | ||
release -> Release the PUDL package to the production PyPI server. | ||
nuke -> Nuke & recreate SQLite & Parquet outputs, then run all tests and | ||
data validations against the new outputs. | ||
get_unmapped_ids -> Make the raw FERC1 DB and generate a PUDL database with only EIA in | ||
order to generate any unmapped IDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test environments were generally out of date and I just cut-and pasted in the current ones.
I noticed that the ferc1_schema
test was gone entirely, and am wondering if we are no longer checking anywhere for inter-year FERC DBF database schema compatibility with the reference year? Or if that got moved to the more generic DBF extractor code? I poked around a bit but didn't see it immediately @rousik
Assuming you do want to run the ETL and build new databases as part of the test | ||
you're running, the contents of that database are determined by an ETL settings | ||
file. By default, the settings file that's used is | ||
``test/settings/integration-test.yml`` But it's also possible to use a | ||
different input file, generating a different database, and then run some | ||
tests against that database. | ||
|
||
For example, we test that FERC 1 data can be loaded into a PUDL database all | ||
by itself by running the ETL tests with a settings file that includes only A | ||
couple of FERC 1 tables for a single year. This is the ``ferc1_solo`` Tox | ||
test environment: | ||
|
||
.. code-block:: console | ||
|
||
$ pytest --etl-settings=test/settings/ferc1-solo-test.yml test/integration/etl_test.py | ||
|
||
Similarly, we use the ``test/settings/full-integration-test.yml`` settings file | ||
to specify an exhaustive collection of input data, and then we run a test that | ||
checks that the database schemas extracted from all historical FERC 1 databases | ||
are compatible with each other. This is the ``ferc1_schema`` test: | ||
|
||
.. code-block:: console | ||
|
||
$ pytest --etl-settings test/settings/full-integration-test.yml test/integration/etl_test.py::test_ferc1_schema | ||
|
||
The raw input data that all the tests use is ultimately coming from our | ||
`archives on Zenodo <https://zenodo.org/communities/catalyst-cooperative>`__. | ||
However, you can optionally tell the tests to look in a different places for more | ||
rapidly accessible caches of that data and to force the download of a fresh | ||
copy (especially useful when you are testing the datastore functionality | ||
specifically). By default, the tests will use the datastore that's part of your | ||
local PUDL workspace. | ||
|
||
For example, to run the ETL portion of the integration tests and download | ||
fresh input data to a temporary datastore that's later deleted automatically: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed references to ferc1_schema
and ferc1_solo
environments, which no longer exist.
docs/pudl/pudl-etl.dot
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this out of data flowchart describing the ETL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the dagster UI already shows a lot of this information now!
"pandas[parquet,excel,fss,gcp,compression]>=2.0,<2.1", | ||
"pandas[compression,excel,fss,gcp,parquet,performance]>=2,<2.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to allowing pandas 2.1.x, I alphabetized these and included performance
since it did result in a noticeable speedup in my testing in the ferc-xbrl-extractor
repo (even though that didn't end up being the fix I used).
@@ -552,7 +552,7 @@ def convert_units(col: pd.Series, params: UnitConversion) -> pd.Series: | |||
new_name = col.name | |||
if (col.name == new_name) & (params.from_unit != "") & (params.to_unit != ""): | |||
logger.debug(f"Old and new column names are identical: {col.name}.") | |||
col = (params.multiplier * col) + params.adder | |||
col = (params.multiplier * pd.to_numeric(col)) + params.adder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were half a dozen tables where col
ended up being a numeric string like "0.0"
as a result of ferc-xbrl-extractor==1.1.1
-- though I'm not sure what about the change in method over there affected the output type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was from 1.1.0 to 1.1.1? Surprising... this seems OK for now but I'm definitely curious what changed.
It's totally possible that we'll keep changing the extractor, though - I don't think this is blocking though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are notes in the PRs / issues but the thing that changed is what kind of NULL value ends up being generated. In v1.1.0 with gb.ffill().iloc[-1]
we ended up with mixed str
+ np.nan
=> object
columns that resulted in numeric values. In v1.1.1 using gb.last()
we ended up with mixed str
+ None
=> object
columns that are interpreted as strings. 🤷🏼
pandas
v2.1 & ferc-xbrl-extractor
v1.1.1
Updates the requirements on [fsspec](https://github.com/fsspec/filesystem_spec) to permit the latest version. - [Commits](fsspec/filesystem_spec@2021.07.0...2023.9.2) --- updated-dependencies: - dependency-name: fsspec dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
Updates the requirements on [dask](https://github.com/dask/dask) to permit the latest version. - [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md) - [Commits](dask/dask@2021.08.0...2023.9.2) --- updated-dependencies: - dependency-name: dask dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
…/fsspec-gte-2021.7-and-lt-2023.9.3 Update fsspec requirement from <2023.6.1,>=2021.7 to >=2021.7,<2023.9.3
…/dask-gte-2021.8-and-lt-2023.9.3 Update dask requirement from <2023.9.2,>=2021.8 to >=2021.8,<2023.9.3
pandas
v2.1 & ferc-xbrl-extractor
v1.1.1pandas
v2.1 & ferc-xbrl-extractor
v1.1.1
…ractor' into pandas-2.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this all looks pretty good! Thanks for doing all this maintenance :)
@@ -44,7 +44,7 @@ class CalculationToleranceFerc1(BaseModel): | |||
intertable_calculation_errors=0.20, | |||
), | |||
"balance_sheet_assets_ferc1": CalculationToleranceFerc1( | |||
intertable_calculation_errors=0.85, | |||
intertable_calculation_errors=1.00, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay man we're back down to 60% on the other other FERC branch!
docs/pudl/pudl-etl.dot
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the dagster UI already shows a lot of this information now!
src/pudl/output/ferc714.py
Outdated
for ref, fix in zip(refs, ASSOCIATIONS): | ||
for year in range(fix["to"][0], fix["to"][1] + 1): | ||
key = (fix["id"], pd.Timestamp(year, 1, 1)) | ||
if key not in dfi.index: | ||
rows.append({**ref, "report_date": key[1]}) | ||
# Append to original table | ||
df = pd.concat([df, pd.DataFrame(rows)]) | ||
new_rows = apply_pudl_dtypes(pd.DataFrame(rows), group="eia") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: maybe rows_dtyped
is a little clearer about what's new
about these rows?
Or, later on line 279 you just skip the variable assignment altogether, which also seems legitimate - maybe we can do that here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, went with the solution from below.
@@ -552,7 +552,7 @@ def convert_units(col: pd.Series, params: UnitConversion) -> pd.Series: | |||
new_name = col.name | |||
if (col.name == new_name) & (params.from_unit != "") & (params.to_unit != ""): | |||
logger.debug(f"Old and new column names are identical: {col.name}.") | |||
col = (params.multiplier * col) + params.adder | |||
col = (params.multiplier * pd.to_numeric(col)) + params.adder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was from 1.1.0 to 1.1.1? Surprising... this seems OK for now but I'm definitely curious what changed.
It's totally possible that we'll keep changing the extractor, though - I don't think this is blocking though.
@jdangerx well for better or worse I apparently love doing this kind of maintenance stuff. |
11c2311
into
2810-run-2021-ferc-1-data-through-new-more-complete-extractor
PR Overview
pandas<2.2
CategoricalDtype
having explicit types now.ferc-xbrl-extractor
See this PR & comment for more details.ferc1_schema
test, since it's been removed from the integration tests.PR Checklist
dev
).