-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor calculation of annualized_respondents_ferc714 #3024
Conversation
Rather than loading huge demand_hourly_pa_ferc714 dataset and calculating report_date columns from these, we can infer these values from the ferc714_settings. Additionally, we can use cross product merge to blow out the respondents, rather than doing the complex procedure that we did up to this point.
on local setup with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a great simplification to me! although tbh i am a little short on context for any of the 714 processing. i would defer to @zaneselvans if there was some special reason for why the demand_hourly_pa_ferc714
should be the source of truth there instead of the settings (and if so we should be an inline comment in there).
But hopefully no, and if the validation tests for the 714 outputs (test/validate/service_territory_test
) all pass with this change then I think this is a great.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #3024 +/- ##
=======================================
- Coverage 92.6% 92.6% -0.0%
=======================================
Files 134 134
Lines 12570 12560 -10
=======================================
- Hits 11645 11635 -10
Misses 925 925 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is a bit funny with the FERC-714 data is that all years (2006-2020) are distributed together -- so no matter what years are in the settings, you will have all of the years in the data, at least initially. So the linkage between what years are specified in the settings and what years show up in the data is much more tenuous than for other datasets.
Do we still get identical outcomes from both methods here when the settings only request a single year of data? Or several years, but not all of them?
Looking at the FERC-714 extract and transform modules, I don't see anything that links the years listed in the settings to the years that are actually extracted and transformed. I think it's all the years, all the time, in which I case I think the only time you would get identical outputs from these two methods is when you're doing the full ETL, with all available years in the settings as well as the data.
I forgot to follow up on this. I assume that what we're trying to see is whether both fast/full runs with this change and without it produces identical results. |
I've now verified that Now, pudl/src/pudl/extract/ferc714.py Line 96 in b188ea2
Hence, using years/dates from settings should result in the same outcome and output-differ confirms this theory. |
Ah okay great -- so the intuitive dependency between the years in the settings and the years that actually get processed (after the initial monolithic CSV extraction) already exists. It looks like @e-belfer added it in the dagsterization of the FERC-714 assets. |
…ve/pudl into ferc714-optimizations
This PR changes how annualized ferc714 respondents are calculated. Instead of trying to calculate distinct
report_date
values from large (15M rows)demand_hourly_pa_ferc714
we synthesize these from ferc714_settings. The logic for fanning outrespondent_id_ferc714
has been replaced with a trivial cross product.This should be output-neutral, but I will need to run the ETL to confirm.