Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run 2021 FERC 1 data through new, more complete extractor #2810

Closed
4 of 5 tasks
jdangerx opened this issue Aug 28, 2023 · 8 comments · Fixed by #2821
Closed
4 of 5 tasks

Run 2021 FERC 1 data through new, more complete extractor #2810

jdangerx opened this issue Aug 28, 2023 · 8 comments · Fixed by #2821
Assignees
Labels
ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition

Comments

@jdangerx
Copy link
Member

jdangerx commented Aug 28, 2023

We got the XBRL extractor to drop a lot less data! Let's make sure it works with just one year of data.

catalyst-cooperative/ferc-xbrl-extractor#105

We should point PUDL at the api_compat branch of ferc-xbrl-extractor, which has both the data completeness fixes and some PUDL API compatibility fixes.

We should also use a branch based off of explode_ferc1 to do this testing, because explode_ferc1 has a lot of changes to how we handle FERC1 transformations.

Scope

@jdangerx jdangerx converted this from a draft issue Aug 28, 2023
@jdangerx jdangerx changed the title Test new extractor against 2021 data & explode_ferc1 branch Run 2021 FERC 1 data through new, less gappy extractor Aug 28, 2023
@jdangerx jdangerx changed the title Run 2021 FERC 1 data through new, less gappy extractor Run 2021 FERC 1 data through new, more complete extractor Aug 28, 2023
@jdangerx jdangerx moved this from Backlog to In progress in Catalyst Megaproject Aug 28, 2023
@zaneselvans zaneselvans linked a pull request Sep 11, 2023 that will close this issue
5 tasks
@aesharpe aesharpe self-assigned this Sep 18, 2023
@jdangerx
Copy link
Member Author

I'm getting these integration test failures locally:

=================================== short test summary info ===================================
FAILED test/integration/etl_test.py::test_pudl_engine - pudl.io_managers.ForeignKeyErrors: Foreign key error for table: denorm_plants_all_ferc1 --...
FAILED test/integration/ferc1_eia_train_test.py::test_generate_all_override_spreadsheets - TypeError: boolean value of NA is ambiguous
FAILED test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_plants_in_plants_ferc1] - AssertionError: Found 87 ['utility_id_ferc1', 'plant_name_ferc1']: MultiIndex([(156, 'tota...
FAILED test/integration/glue_test.py::test_for_unmapped_ids_minus_one[check_for_unmmapped_plants_in_plants_ferc1] - AssertionError: Found 88 ['utility_id_ferc1', 'plant_name_ferc1'] but expected 1.
== 4 failed, 79 passed, 4 skipped, 3 xfailed, 5 xpassed, 2895 warnings in 3988.70s (1:06:28) ==

And these validation test failures:

FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fbp_ferc1-25421] - ValueError: fbp_ferc1: found 25423 rows, ex
pected 25421. Off by 0.008%, allowed margin of 0.000%                                                                           
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fuel_ferc1-48841] - ValueError: fuel_ferc1: found 48843 rows, 
expected 48841. Off by 0.004%, allowed margin of 0.000%                                                                         
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plant_in_service_ferc1-311986] - ValueError: plant_in_service_
ferc1: found 311988 rows, expected 311986. Off by 0.001%, allowed margin of 0.000%                                              
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_all_ferc1-54284] - ValueError: plants_all_ferc1: found 
54384 rows, expected 54284. Off by 0.184%, allowed margin of 0.000%                                                             
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_steam_ferc1-30709] - ValueError: plants_steam_ferc1: fo
und 30809 rows, expected 30709. Off by 0.326%, allowed margin of 0.000%                                                         
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-purchased_power_ferc1-197523] - ValueError: purchased_power_fe
rc1: found 197665 rows, expected 197523. Off by 0.072%, allowed margin of 0.000%    

@aesharpe aesharpe added ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition labels Sep 18, 2023
@jdangerx
Copy link
Member Author

The CI seems to be having some trouble - getting cancellation events out of the blue, usually a few minutes into taxonomy parsing.

@zschira and I noticed that the taxonomy parsing is somehow taking much longer than expected, so we might be running into some resource limits.

In other contexts, the Parsing taxonomy... process takes on the order of 10 seconds - but in the xbrl2sqlite context, we're getting much longer runtimes. For example, locally I have:

2023-09-18 15:24:45 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:257 Parsing taxonomy from <_io.BytesIO object at 0x297292bb0>
2023-09-18 15:26:59 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:150 Finished batch 1/6

@aesharpe
Copy link
Member

Using branch 2810-run-2021-ferc-1-data-through-new-more-complete-extractor to run tests

@aesharpe
Copy link
Member

aesharpe commented Sep 21, 2023

@zaneselvans @zschira

The integration errors mostly pertain to unmapped ferc1 plant ids from total records. There are already 227 total records in the denorm_plants_all_ferc1 table, do we want to:

A) remove all the totals (227+87);

B) map all the 87 totals so we have them all;

C) remove just these 87 totals because there is something ~ different ~ about them seeing as they were recovered from this new extractor.

@zschira
Copy link
Member

zschira commented Sep 21, 2023

@aesharpe tbh I don't know that I have enough context to give an informed opinion on the existing total records, but my inclination would be to just drop all of them. It seems to be a very small fraction of the records in the table, and I don't personally see any compelling reason why the the new ones should be treated differently from these.

@zaneselvans
Copy link
Member

Can you explain what an "unmapped ferc1 plant ID from total record" is? When you say "plant ID" do you mean the plant name, which is "total"? Or do you mean the integer plant ID that's algorithmically assigned based on record similarity across years? Are the "total" records getting fed into the plant ID assignment process?

Where are the pre-existing "total" plants coming from? Are those junk records from the small plants table? Or were there also some total records from the large steam plants? Have we historically mapped "total" plant IDs? Or are they associated with unmapped plants?

I'm a little wary of dropping the old totals since the way plants are reported in the DBF data is so messy -- maybe in some cases that's the only data that's available? But at the same time it's probably very low value data, and may also be resulting in double counting of stuff like capacity and generation, if the total is present and also non-totals are present.

If we're going to drop them, it seems like these totals shouldn't be getting dropped from denorm_plants_all_ferc1 but from the tables upstream where they are originating, shouldn't they?

@aesharpe
Copy link
Member

aesharpe commented Sep 21, 2023

Can you explain what an "unmapped ferc1 plant ID from total record" is?

When you run the script tox -e get_unmapped_ids on the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch there are unmapped plants. These plants all say total in the plant_name_ferc1 field.

Are the "total" records getting fed into the plant ID assignment process?

I think that's what I'm talking about here...that these total records are appearing in the plant id assignment process.

Where are the pre-existing "total" plants coming from?

Many come from the small plants table but some come from the steam table. All the new total values come from the steam table.

I'm a little wary of dropping the old totals since the way plants are reported in the DBF data is so messy

Generally I agree, but it feels weird to drop some but not all unless we have a very specific reason. Maybe that reason is just the messy-ness of the DBF files?

If we're going to drop them, it seems like these totals shouldn't be getting dropped from denorm_plants_all_ferc1 but from the tables upstream where they are originating, shouldn't they?

Yeah, I was just using this table to look at all the potential "total" records in one place.

@zaneselvans
Copy link
Member

I agree with @zschira's comment on Slack: if there are totals in DBF, and some totals that already exist and are getting mapped in the XBRL (because they were reported in a way that made them show up with the old extractor) then the new totals that are showing up because we've fixed the extractor to catch the ones that weren't coming through before, they should be retained and mapped for uniformity across these 3 categories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition
Projects
Archived in project
4 participants