-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract more data from FERC XBRLs and handle that new data in ETL #2821
Extract more data from FERC XBRLs and handle that new data in ETL #2821
Conversation
560b394
to
8762269
Compare
529c10d
to
1eaef5a
Compare
3d6912b
to
8676c93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a TODO list for further review / massage.
cf34e93
to
8a77426
Compare
OK, this is in the commit message, but I went ahead and committed some changes. @aesharpe let me know if these are reasonable: Totally new:
Mapped to existing PUDL ID:
Note the misspelling of the plant name in 1287. Changed:
|
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #2821 +/- ##
=======================================
- Coverage 88.6% 88.5% -0.1%
=======================================
Files 90 90
Lines 10809 10795 -14
=======================================
- Hits 9577 9563 -14
Misses 1232 1232
☔ View full report in Codecov by Sentry. |
The new extractor added some data to the 2021 XBRL archives. This caused some integration and validation test fails. I added some plants to the pudl_id mapping spreadsheet, all of which are considered totals. I.e., not real plants, but we're mapping them for the sake of giving them an ID (they are not connected to EIA records). Because this is how we treat other total records reported to FERC1. This also updates the way that values were assigned to a slice of the ferc1_eia_train output spreadsheets. NA values were causing an issue, so I had to change how the values were being converted. This also updates the test_minmax_rows test to reflect the new rows in the 2021 data.
Totally new: * 18012: pjm interconnection, llc / total * 18013: new york state electric & gas corporation / see footnote * 18014: southwest power pool, inc. / total * 18015: public service company of colorado / community solar gardens * 18016: the empire district electric company / n/a each & 73 units at 2.52 mw each) * 18017: wisconsin electric power company / see footnote * 18018: upper michigan energy resources company (pudl determined) / total * 18019: new york transco, llc / total * 18020: wilderness line holdings, llc / total * 18021: mt. carmel public utility co / total Mapped to existing PUDL ID: * 8671: pacific gas & electric company, small hydroelectric generating plants * 15000: idaho power company / hydro * 15001: idaho power company / internal combustion * 15068: public service company of colorado / conventional hydro * 12926: midamerican energy company / ida grove ii wind farm (8 units at 2.3 mw * 1287: alaska electric light and power company / salmon creek hyrdo Note the misspelling of the plant name in 1287. Changed: * 15031: mt. carmel public utility co / not applicable -> ameren illinois company / not applicable This one had a mismatch between utility_id_ferc 222, which corresponds to Ameren, not Mt. Carmel (397).
f3674c0
to
6f37ca8
Compare
4d35e21
to
58cd41b
Compare
58cd41b
to
2abf505
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration tests are passing so I assume this isn't an issue, but I was confused by the removal of the ferc_xbrl
fixture. Did it get replaced by something previously, but not ripped out?
@pytest.fixture(scope="session") | ||
def ferc_xbrl( | ||
live_dbs, | ||
ferc_to_sqlite_settings, | ||
pudl_datastore_fixture, | ||
): | ||
"""Extract XBRL filings and produce raw DB+metadata files. | ||
|
||
Extracts a subset of filings for each form for the year 2021. | ||
""" | ||
if not live_dbs: | ||
year = 2021 | ||
|
||
# Prep datastore | ||
datastore = FercXbrlDatastore(pudl_datastore_fixture) | ||
|
||
# Set step size for subsetting | ||
step_size = 5 | ||
|
||
for form in XbrlFormNumber: | ||
raw_archive, taxonomy_entry_point = datastore.get_taxonomy(year, form) | ||
|
||
sqlite_engine = _get_sqlite_engine(form.value, True) | ||
|
||
form_settings = ferc_to_sqlite_settings.get_xbrl_dataset_settings(form) | ||
|
||
# Extract every fifth filing | ||
filings_subset = datastore.get_filings(year, form)[::step_size] | ||
xbrl.extract( | ||
filings_subset, | ||
sqlite_engine, | ||
raw_archive, | ||
form.value, | ||
requested_tables=form_settings.tables, | ||
batch_size=len(filings_subset) // step_size + 1, | ||
workers=step_size, | ||
# TODO(janrous): the following should ideally be provided by some | ||
# ferc dataset metadata object rather than encoding this in settings. | ||
datapackage_path=PudlPaths().output_file( | ||
f"ferc{form.value}_xbrl_datapackage.json" | ||
), | ||
metadata_path=PudlPaths().output_file( | ||
f"ferc{form.value}_xbrl_taxonomy_metadata.json" | ||
), | ||
archive_file_path=taxonomy_entry_point, | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this fixture, how are the FERC XBRL databases being generated for use in the ETL tests, and how are we doing integration testing to ensure that we're able to extract data from all the forms? Is this just cruft that's been replaced by other fixtures now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These guys are being generated by the ferc_to_sqlite_xbrl_only
fixture, now.
Changes required to get the FERC 1 assets materializing properly, while pointed at the new XBRL extractor (for a more-complete xbrl2sqlite) and the new XBRL archives.
xbrl2sqlite
onferc-xbrl-extractor
- to match what's on theapi_compat
branchnum_transmission_circuits
fromint
tonumber
- this value contains"0.0"
whichint
doesn't like, butfloat
can handle. Potentially we can do a pre-conversion from string to float, and then convert that to int when we're applying dtypes. This was breakingtransmission_statistics_ferc1
.utility_plant_summary_ferc1
total
andother
as utility type categories across the board. These were being turned intoNA
s because they were "unrecognizable" categories, which then led to spurious dupes down the line, and brokeelectric_plant_depreciation_changes_ferc1
.One big change we made in the extractor itself was to take multiple filings from one entity and merge them, treating later filings as updates to earlier filings.
TODO:
ferc-xbrl-extractor
instead of relying on theReportDate
fact, but I think that can be a follow-up PR.extract.xbrl
logic, since we changed the logic quite a lotapi-compat
as ferc-xbrl-extractor 1.0, and point PUDL at that instead of this git ref we have now