Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract more data from FERC XBRLs and handle that new data in ETL #2821

Merged

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Sep 1, 2023

Changes required to get the FERC 1 assets materializing properly, while pointed at the new XBRL extractor (for a more-complete xbrl2sqlite) and the new XBRL archives.

  • change how we find the filings to run xbrl2sqlite on
  • update how we call ferc-xbrl-extractor - to match what's on the api_compat branch
  • change num_transmission_circuits from int to number - this value contains "0.0" which int doesn't like, but float can handle. Potentially we can do a pre-conversion from string to float, and then convert that to int when we're applying dtypes. This was breaking transmission_statistics_ferc1.
  • drop end-of-previous-year values for the instant table that feeds utility_plant_summary_ferc1
  • add total and other as utility type categories across the board. These were being turned into NAs because they were "unrecognizable" categories, which then led to spurious dupes down the line, and broke electric_plant_depreciation_changes_ferc1.

One big change we made in the extractor itself was to take multiple filings from one entity and merge them, treating later filings as updates to earlier filings.

TODO:

@jdangerx jdangerx changed the title 2810 run 2021 ferc 1 data through new more complete extractor Extract more data from FERC XBRLs and handle that new data in ETL Sep 1, 2023
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch 2 times, most recently from 560b394 to 8762269 Compare September 8, 2023 21:41
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition labels Sep 8, 2023
@zaneselvans zaneselvans linked an issue Sep 11, 2023 that may be closed by this pull request
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 529c10d to 1eaef5a Compare September 13, 2023 16:58
@jdangerx jdangerx marked this pull request as ready for review September 13, 2023 19:35
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch 2 times, most recently from 3d6912b to 8676c93 Compare September 18, 2023 19:03
pyproject.toml Outdated Show resolved Hide resolved
Copy link
Member Author

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a TODO list for further review / massage.

src/pudl/output/ferc1.py Outdated Show resolved Hide resolved
src/pudl/output/ferc714.py Outdated Show resolved Hide resolved
src/pudl/transform/classes.py Outdated Show resolved Hide resolved
migrations/versions/11a43f756905_idk.py Outdated Show resolved Hide resolved
migrations/versions/273a78878b74_purchased_storage_mwh.py Outdated Show resolved Hide resolved
src/pudl/analysis/ferc1_eia_train.py Outdated Show resolved Hide resolved
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from cf34e93 to 8a77426 Compare October 3, 2023 20:24
@jdangerx
Copy link
Member Author

jdangerx commented Oct 4, 2023

OK, this is in the commit message, but I went ahead and committed some changes. @aesharpe let me know if these are reasonable:

Totally new:

  • 18012: pjm interconnection, llc / total
  • 18013: new york state electric & gas corporation / see footnote
  • 18014: southwest power pool, inc. / total
  • 18015: public service company of colorado / community solar gardens
  • 18016: the empire district electric company / n/a
    each & 73 units at 2.52 mw each)
  • 18017: wisconsin electric power company / see footnote
  • 18018: upper michigan energy resources company (pudl determined) / total
  • 18019: new york transco, llc / total
  • 18020: wilderness line holdings, llc / total
  • 18021: mt. carmel public utility co / total

Mapped to existing PUDL ID:

  • 8671: pacific gas & electric company, small hydroelectric generating plants
  • 15000: idaho power company / hydro
  • 15001: idaho power company / internal combustion
  • 15068: public service company of colorado / conventional hydro
  • 12926: midamerican energy company / ida grove ii wind farm (8 units at 2.3 mw
  • 1287: alaska electric light and power company / salmon creek hyrdo

Note the misspelling of the plant name in 1287.

Changed:

  • 15031: mt. carmel public utility co / not applicable -> ameren
    illinois company / not applicable

    This one had a mismatch between utility_id_ferc 222, which corresponds
    to Ameren, not Mt. Carmel (397).

@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (67822df) 88.6% compared to head (2abf505) 88.5%.
Report is 1 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2821     +/-   ##
=======================================
- Coverage   88.6%   88.5%   -0.1%     
=======================================
  Files         90      90             
  Lines      10809   10795     -14     
=======================================
- Hits        9577    9563     -14     
  Misses      1232    1232             
Files Coverage Δ
src/pudl/analysis/ferc1_eia_train.py 53.8% <100.0%> (+0.8%) ⬆️
src/pudl/extract/xbrl.py 95.5% <100.0%> (-1.6%) ⬇️
src/pudl/metadata/classes.py 86.4% <ø> (ø)
src/pudl/output/ferc714.py 96.2% <100.0%> (ø)
src/pudl/transform/classes.py 94.6% <100.0%> (+<0.1%) ⬆️
src/pudl/transform/ferc1.py 96.6% <100.0%> (+<0.1%) ⬆️
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/workspace/datastore.py 77.1% <100.0%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zschira and others added 10 commits October 6, 2023 09:53
The new extractor added some data to the 2021 XBRL archives. This caused some integration and validation test fails. I added some plants to the pudl_id mapping spreadsheet, all of which are considered totals. I.e., not real plants, but we're mapping them for the sake of giving them an ID (they are not connected to EIA records). Because this is how we treat other total records reported to FERC1.

This also updates the way that values were assigned to a slice of the ferc1_eia_train output spreadsheets. NA values were causing an issue, so I had to change how the values were being converted.

This also updates the test_minmax_rows test to reflect the new rows in the 2021 data.
Totally new:

* 18012: pjm interconnection, llc / total
* 18013: new york state electric & gas corporation / see footnote
* 18014: southwest power pool, inc. / total
* 18015: public service company of colorado / community solar gardens
* 18016: the empire district electric company / n/a
  each & 73 units at 2.52 mw each)
* 18017: wisconsin electric power company / see footnote
* 18018: upper michigan energy resources company (pudl determined) / total
* 18019: new york transco, llc / total
* 18020: wilderness line holdings, llc / total
* 18021: mt. carmel public utility co / total

Mapped to existing PUDL ID:

* 8671: pacific gas & electric company, small hydroelectric generating plants
* 15000: idaho power company / hydro
* 15001: idaho power company / internal combustion
* 15068: public service company of colorado / conventional hydro
* 12926: midamerican energy company / ida grove ii wind farm (8 units at 2.3 mw
* 1287: alaska electric light and power company / salmon creek hyrdo

Note the misspelling of the plant name in 1287.

Changed:

* 15031: mt. carmel public utility co / not applicable -> ameren
  illinois company / not applicable

  This one had a mismatch between utility_id_ferc 222, which corresponds
  to Ameren, not Mt. Carmel (397).
There are some missing data due to messy deduplication:
#2822

But we'll do the deduplication better in here:
#2899
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from f3674c0 to 6f37ca8 Compare October 6, 2023 13:53
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 4d35e21 to 58cd41b Compare October 6, 2023 14:17
@jdangerx jdangerx requested a review from aesharpe October 6, 2023 15:38
@jdangerx jdangerx force-pushed the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch from 58cd41b to 2abf505 Compare October 6, 2023 16:53
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration tests are passing so I assume this isn't an issue, but I was confused by the removal of the ferc_xbrl fixture. Did it get replaced by something previously, but not ripped out?

Comment on lines -266 to -313
@pytest.fixture(scope="session")
def ferc_xbrl(
live_dbs,
ferc_to_sqlite_settings,
pudl_datastore_fixture,
):
"""Extract XBRL filings and produce raw DB+metadata files.

Extracts a subset of filings for each form for the year 2021.
"""
if not live_dbs:
year = 2021

# Prep datastore
datastore = FercXbrlDatastore(pudl_datastore_fixture)

# Set step size for subsetting
step_size = 5

for form in XbrlFormNumber:
raw_archive, taxonomy_entry_point = datastore.get_taxonomy(year, form)

sqlite_engine = _get_sqlite_engine(form.value, True)

form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form)

# Extract every fifth filing
filings_subset = datastore.get_filings(year, form)[::step_size]
xbrl.extract(
filings_subset,
sqlite_engine,
raw_archive,
form.value,
requested_tables=form_settings.tables,
batch_size=len(filings_subset) // step_size + 1,
workers=step_size,
# TODO(janrous): the following should ideally be provided by some
# ferc dataset metadata object rather than encoding this in settings.
datapackage_path=PudlPaths().output_file(
f"ferc{form.value}_xbrl_datapackage.json"
),
metadata_path=PudlPaths().output_file(
f"ferc{form.value}_xbrl_taxonomy_metadata.json"
),
archive_file_path=taxonomy_entry_point,
)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this fixture, how are the FERC XBRL databases being generated for use in the ETL tests, and how are we doing integration testing to ensure that we're able to extract data from all the forms? Is this just cruft that's been replaced by other fixtures now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These guys are being generated by the ferc_to_sqlite_xbrl_only fixture, now.

@jdangerx jdangerx merged commit e36cec5 into dev Oct 6, 2023
13 checks passed
@zaneselvans zaneselvans deleted the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch October 6, 2023 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc1 Anything having to do with FERC Form 1 xbrl Related to the FERC XBRL transition
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Run 2021 FERC 1 data through new, more complete extractor
4 participants