Refactor calculation of annualized_respondents_ferc714 #3024

rousik · 2023-11-07T18:53:50Z

This PR changes how annualized ferc714 respondents are calculated. Instead of trying to calculate distinct report_date values from large (15M rows) demand_hourly_pa_ferc714 we synthesize these from ferc714_settings. The logic for fanning out respondent_id_ferc714 has been replaced with a trivial cross product.

This should be output-neutral, but I will need to run the ETL to confirm.

Rather than loading huge demand_hourly_pa_ferc714 dataset and calculating report_date columns from these, we can infer these values from the ferc714_settings. Additionally, we can use cross product merge to blow out the respondents, rather than doing the complex procedure that we did up to this point.

rousik · 2023-11-07T19:16:08Z

on local setup with etl_full.yml assets, I re-ran the annualized_respondents_ferc714, summarized_respondents_ferc714 and output assets fipsified_respondents_ferc714 and summarized_demand_ferc714 and compared the outputs for the fipsified/summarized tables and found no differences so this should be safe to go.

cmgosnell

this looks like a great simplification to me! although tbh i am a little short on context for any of the 714 processing. i would defer to @zaneselvans if there was some special reason for why the demand_hourly_pa_ferc714 should be the source of truth there instead of the settings (and if so we should be an inline comment in there).

But hopefully no, and if the validation tests for the 714 outputs (test/validate/service_territory_test) all pass with this change then I think this is a great.

codecov · 2023-11-07T20:37:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (bfe6203) 92.6% compared to head (942c8c8) 92.6%.
Report is 2 commits behind head on dev.

Additional details and impacted files

@@           Coverage Diff           @@
##             dev   #3024     +/-   ##
=======================================
- Coverage   92.6%   92.6%   -0.0%     
=======================================
  Files        134     134             
  Lines      12570   12560     -10     
=======================================
- Hits       11645   11635     -10     
  Misses       925     925

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zaneselvans

One thing that is a bit funny with the FERC-714 data is that all years (2006-2020) are distributed together -- so no matter what years are in the settings, you will have all of the years in the data, at least initially. So the linkage between what years are specified in the settings and what years show up in the data is much more tenuous than for other datasets.

Do we still get identical outcomes from both methods here when the settings only request a single year of data? Or several years, but not all of them?

Looking at the FERC-714 extract and transform modules, I don't see anything that links the years listed in the settings to the years that are actually extracted and transformed. I think it's all the years, all the time, in which I case I think the only time you would get identical outputs from these two methods is when you're doing the full ETL, with all available years in the settings as well as the data.

rousik · 2023-12-03T17:19:14Z

I forgot to follow up on this. I assume that what we're trying to see is whether both fast/full runs with this change and without it produces identical results.

rousik · 2023-12-05T05:36:15Z

One thing that is a bit funny with the FERC-714 data is that all years (2006-2020) are distributed together -- so no matter what years are in the settings, you will have all of the years in the data, at least initially. So the linkage between what years are specified in the settings and what years show up in the data is much more tenuous than for other datasets.

Do we still get identical outcomes from both methods here when the settings only request a single year of data? Or several years, but not all of them?

Looking at the FERC-714 extract and transform modules, I don't see anything that links the years listed in the settings to the years that are actually extracted and transformed. I think it's all the years, all the time, in which I case I think the only time you would get identical outputs from these two methods is when you're doing the full ETL, with all available years in the settings as well as the data.

I've now verified that etl_fast outputs for this are equal on this branch compared to dev. To further verify, I have traced the lineage of the affected dataframes. We're modifying annualized_respondents_ferc714 and removing the former input which was demand_hourly_pa_ferc714 and was used to generate valid dates.

Now, demand_hourly_pa_ferc714 is calculated from raw_ferc714__demand_hourly_pa, which is provided by the extract_ferc714 multi-asset and this one filters by the report_yr matching years from the associated ferc714_settings, see:

pudl/src/pudl/extract/ferc714.py

Line 96 in b188ea2

"report_yr in @ferc714_settings.years"

Hence, using years/dates from settings should result in the same outcome and output-differ confirms this theory.

zaneselvans · 2023-12-05T14:53:12Z

Ah okay great -- so the intuitive dependency between the years in the settings and the years that actually get processed (after the initial monolithic CSV extraction) already exists. It looks like @e-belfer added it in the dagsterization of the FERC-714 assets.

…ve/pudl into ferc714-optimizations

rousik requested review from zaneselvans and cmgosnell November 7, 2023 18:53

rousik added 2 commits November 7, 2023 11:55

Put back check that report_date is not in respondent_id_ferc714

33490cf

import dataset_settings resource into annualized_respondents_ferc714

a22e0f2

cmgosnell approved these changes Nov 7, 2023

View reviewed changes

jdangerx added the community label Nov 11, 2023

zaneselvans reviewed Nov 14, 2023

View reviewed changes

rousik and others added 2 commits December 3, 2023 10:20

Merge remote-tracking branch 'origin/dev' into ferc714-optimizations

cea73e7

Update conda-lock.yml and rendered conda environment files.

2691e8f

Merge remote-tracking branch 'origin/dev' into ferc714-optimizations

097e97f

rousik and others added 2 commits December 5, 2023 11:24

Merge branch 'dev' into ferc714-optimizations

a555217

Update conda-lock.yml and rendered conda environment files.

fc3dc3b

zaneselvans approved these changes Dec 6, 2023

View reviewed changes

zaneselvans and others added 7 commits December 6, 2023 11:27

Merge branch 'dev' into ferc714-optimizations

5abd34b

Merge branch 'dev' into ferc714-optimizations

bbdbf4b

Update conda-lock.yml and rendered conda environment files.

30ffabd

Merge branch 'ferc714-optimizations' of github.com:catalyst-cooperati…

e7e9a5f

…ve/pudl into ferc714-optimizations

Merge branch 'dev' into ferc714-optimizations

609a227

Bring in conda-lock files from dev

9e455a8

Update conda-lock.yml and rendered conda environment files.

942c8c8

zaneselvans enabled auto-merge (squash) December 12, 2023 20:53

zaneselvans disabled auto-merge December 12, 2023 22:27

zaneselvans enabled auto-merge (squash) December 12, 2023 22:27

zaneselvans disabled auto-merge December 12, 2023 22:41

zaneselvans merged commit 694032d into dev Dec 12, 2023
15 of 16 checks passed

zaneselvans deleted the ferc714-optimizations branch December 12, 2023 22:41

zaneselvans mentioned this pull request Dec 13, 2023

Merge dev into main for 2023-12-13 #3153

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor calculation of annualized_respondents_ferc714 #3024

Refactor calculation of annualized_respondents_ferc714 #3024

rousik commented Nov 7, 2023

rousik commented Nov 7, 2023

cmgosnell left a comment

codecov bot commented Nov 7, 2023 •

edited

Loading

zaneselvans left a comment

rousik commented Dec 3, 2023

rousik commented Dec 5, 2023

zaneselvans commented Dec 5, 2023

Refactor calculation of annualized_respondents_ferc714 #3024

Refactor calculation of annualized_respondents_ferc714 #3024

Conversation

rousik commented Nov 7, 2023

rousik commented Nov 7, 2023

cmgosnell left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 7, 2023 • edited Loading

Codecov Report

zaneselvans left a comment

Choose a reason for hiding this comment

rousik commented Dec 3, 2023

rousik commented Dec 5, 2023

zaneselvans commented Dec 5, 2023

codecov bot commented Nov 7, 2023 •

edited

Loading