Run 2021 FERC 1 data through new, more complete extractor #2810

jdangerx · 2023-08-28T19:34:45Z

We got the XBRL extractor to drop a lot less data! Let's make sure it works with just one year of data.

catalyst-cooperative/ferc-xbrl-extractor#105

We should point PUDL at the api_compat branch of ferc-xbrl-extractor, which has both the data completeness fixes and some PUDL API compatibility fixes.

We should also use a branch based off of explode_ferc1 to do this testing, because explode_ferc1 has a lot of changes to how we handle FERC1 transformations.

Scope

Give feedback

ferc_to_sqlite tasks complete successfully, targeting 2021 data;
norm_ferc1 asset group completes successfully with 2021 data in 10.5072/zenodo.1234455, while running on the feature branch associated with this issue
data validation tests either pass or are updated
ferc-xbrl-extractor tests are updated to catch future regressions
FERC 714 asset continue to complete successfully
Options

The text was updated successfully, but these errors were encountered:

jdangerx · 2023-09-18T19:22:38Z

I'm getting these integration test failures locally:

=================================== short test summary info ===================================
FAILED test/integration/etl_test.py::test_pudl_engine - pudl.io_managers.ForeignKeyErrors: Foreign key error for table: denorm_plants_all_ferc1 --...
FAILED test/integration/ferc1_eia_train_test.py::test_generate_all_override_spreadsheets - TypeError: boolean value of NA is ambiguous
FAILED test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_plants_in_plants_ferc1] - AssertionError: Found 87 ['utility_id_ferc1', 'plant_name_ferc1']: MultiIndex([(156, 'tota...
FAILED test/integration/glue_test.py::test_for_unmapped_ids_minus_one[check_for_unmmapped_plants_in_plants_ferc1] - AssertionError: Found 88 ['utility_id_ferc1', 'plant_name_ferc1'] but expected 1.
== 4 failed, 79 passed, 4 skipped, 3 xfailed, 5 xpassed, 2895 warnings in 3988.70s (1:06:28) ==

And these validation test failures:

FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fbp_ferc1-25421] - ValueError: fbp_ferc1: found 25423 rows, ex
pected 25421. Off by 0.008%, allowed margin of 0.000%                                                                           
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fuel_ferc1-48841] - ValueError: fuel_ferc1: found 48843 rows, 
expected 48841. Off by 0.004%, allowed margin of 0.000%                                                                         
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plant_in_service_ferc1-311986] - ValueError: plant_in_service_
ferc1: found 311988 rows, expected 311986. Off by 0.001%, allowed margin of 0.000%                                              
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_all_ferc1-54284] - ValueError: plants_all_ferc1: found 
54384 rows, expected 54284. Off by 0.184%, allowed margin of 0.000%                                                             
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_steam_ferc1-30709] - ValueError: plants_steam_ferc1: fo
und 30809 rows, expected 30709. Off by 0.326%, allowed margin of 0.000%                                                         
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-purchased_power_ferc1-197523] - ValueError: purchased_power_fe
rc1: found 197665 rows, expected 197523. Off by 0.072%, allowed margin of 0.000%

jdangerx · 2023-09-18T19:31:25Z

The CI seems to be having some trouble - getting cancellation events out of the blue, usually a few minutes into taxonomy parsing.

@zschira and I noticed that the taxonomy parsing is somehow taking much longer than expected, so we might be running into some resource limits.

In other contexts, the Parsing taxonomy... process takes on the order of 10 seconds - but in the xbrl2sqlite context, we're getting much longer runtimes. For example, locally I have:

2023-09-18 15:24:45 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:257 Parsing taxonomy from <_io.BytesIO object at 0x297292bb0>
2023-09-18 15:26:59 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:150 Finished batch 1/6

aesharpe · 2023-09-18T19:37:39Z

Using branch 2810-run-2021-ferc-1-data-through-new-more-complete-extractor to run tests

aesharpe · 2023-09-21T16:30:26Z

@zaneselvans @zschira

The integration errors mostly pertain to unmapped ferc1 plant ids from total records. There are already 227 total records in the denorm_plants_all_ferc1 table, do we want to:

A) remove all the totals (227+87);

B) map all the 87 totals so we have them all;

C) remove just these 87 totals because there is something ~ different ~ about them seeing as they were recovered from this new extractor.

zschira · 2023-09-21T17:53:50Z

@aesharpe tbh I don't know that I have enough context to give an informed opinion on the existing total records, but my inclination would be to just drop all of them. It seems to be a very small fraction of the records in the table, and I don't personally see any compelling reason why the the new ones should be treated differently from these.

zaneselvans · 2023-09-21T17:56:18Z

Can you explain what an "unmapped ferc1 plant ID from total record" is? When you say "plant ID" do you mean the plant name, which is "total"? Or do you mean the integer plant ID that's algorithmically assigned based on record similarity across years? Are the "total" records getting fed into the plant ID assignment process?

Where are the pre-existing "total" plants coming from? Are those junk records from the small plants table? Or were there also some total records from the large steam plants? Have we historically mapped "total" plant IDs? Or are they associated with unmapped plants?

I'm a little wary of dropping the old totals since the way plants are reported in the DBF data is so messy -- maybe in some cases that's the only data that's available? But at the same time it's probably very low value data, and may also be resulting in double counting of stuff like capacity and generation, if the total is present and also non-totals are present.

If we're going to drop them, it seems like these totals shouldn't be getting dropped from denorm_plants_all_ferc1 but from the tables upstream where they are originating, shouldn't they?

aesharpe · 2023-09-21T19:36:34Z

Can you explain what an "unmapped ferc1 plant ID from total record" is?

When you run the script tox -e get_unmapped_ids on the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch there are unmapped plants. These plants all say total in the plant_name_ferc1 field.

Are the "total" records getting fed into the plant ID assignment process?

I think that's what I'm talking about here...that these total records are appearing in the plant id assignment process.

Where are the pre-existing "total" plants coming from?

Many come from the small plants table but some come from the steam table. All the new total values come from the steam table.

I'm a little wary of dropping the old totals since the way plants are reported in the DBF data is so messy

Generally I agree, but it feels weird to drop some but not all unless we have a very specific reason. Maybe that reason is just the messy-ness of the DBF files?

If we're going to drop them, it seems like these totals shouldn't be getting dropped from denorm_plants_all_ferc1 but from the tables upstream where they are originating, shouldn't they?

Yeah, I was just using this table to look at all the potential "total" records in one place.

zaneselvans · 2023-09-21T20:49:10Z

I agree with @zschira's comment on Slack: if there are totals in DBF, and some totals that already exist and are getting mapped in the XBRL (because they were reported in a way that made them show up with the old extractor) then the new totals that are showing up because we've fixed the extractor to catch the ones that weren't coming through before, they should be retained and mapped for uniformity across these 3 categories.

jdangerx added this to Catalyst Megaproject Aug 24, 2023

jdangerx converted this from a draft issue Aug 28, 2023

jdangerx changed the title ~~Test new extractor against 2021 data & explode_ferc1 branch~~ Run 2021 FERC 1 data through new, less gappy extractor Aug 28, 2023

jdangerx mentioned this issue Aug 28, 2023

Integrate FERC 2021-2022 data with new extractor #2811

Closed

12 tasks

jdangerx changed the title ~~Run 2021 FERC 1 data through new, less gappy extractor~~ Run 2021 FERC 1 data through new, more complete extractor Aug 28, 2023

jdangerx moved this from Backlog to In progress in Catalyst Megaproject Aug 28, 2023

aesharpe mentioned this issue Aug 28, 2023

Integrate FERC Form 1 2022 data #2813

Closed

11 tasks

zaneselvans linked a pull request Sep 11, 2023 that will close this issue

Extract more data from FERC XBRLs and handle that new data in ETL #2821

Merged

5 tasks

jdangerx mentioned this issue Sep 11, 2023

Make sure new XBRL extractor works with FERC explosion #2852

Closed

aesharpe self-assigned this Sep 18, 2023

aesharpe added ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition labels Sep 18, 2023

This was referenced Sep 25, 2023

Fix inter-table 1:1 dimensions in calculations for XBRL explosion #2890

Merged

Clean up exploded assets and rate base tags #2892

Closed

jdangerx moved this from In progress to In review in Catalyst Megaproject Oct 5, 2023

jdangerx moved this from In review to Done in Catalyst Megaproject Oct 6, 2023

jdangerx closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run 2021 FERC 1 data through new, more complete extractor #2810

Run 2021 FERC 1 data through new, more complete extractor #2810

jdangerx commented Aug 28, 2023 •

edited

Loading

Scope

jdangerx commented Sep 18, 2023

jdangerx commented Sep 18, 2023

aesharpe commented Sep 18, 2023

aesharpe commented Sep 21, 2023 •

edited

Loading

zschira commented Sep 21, 2023

zaneselvans commented Sep 21, 2023

aesharpe commented Sep 21, 2023 •

edited

Loading

zaneselvans commented Sep 21, 2023

Run 2021 FERC 1 data through new, more complete extractor #2810

Run 2021 FERC 1 data through new, more complete extractor #2810

Comments

jdangerx commented Aug 28, 2023 • edited Loading

Scope

jdangerx commented Sep 18, 2023

jdangerx commented Sep 18, 2023

aesharpe commented Sep 18, 2023

aesharpe commented Sep 21, 2023 • edited Loading

zschira commented Sep 21, 2023

zaneselvans commented Sep 21, 2023

aesharpe commented Sep 21, 2023 • edited Loading

zaneselvans commented Sep 21, 2023

jdangerx commented Aug 28, 2023 •

edited

Loading

aesharpe commented Sep 21, 2023 •

edited

Loading

aesharpe commented Sep 21, 2023 •

edited

Loading