Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor calculation of annualized_respondents_ferc714 #3024

Merged
merged 15 commits into from
Dec 12, 2023

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Nov 7, 2023

This PR changes how annualized ferc714 respondents are calculated. Instead of trying to calculate distinct report_date values from large (15M rows) demand_hourly_pa_ferc714 we synthesize these from ferc714_settings. The logic for fanning out respondent_id_ferc714 has been replaced with a trivial cross product.

This should be output-neutral, but I will need to run the ETL to confirm.

Rather than loading huge demand_hourly_pa_ferc714 dataset and calculating
report_date columns from these, we can infer these values from the ferc714_settings.
Additionally, we can use cross product merge to blow out the respondents, rather
than doing the complex procedure that we did up to this point.
@rousik
Copy link
Collaborator Author

rousik commented Nov 7, 2023

on local setup with etl_full.yml assets, I re-ran the annualized_respondents_ferc714, summarized_respondents_ferc714 and output assets fipsified_respondents_ferc714 and summarized_demand_ferc714 and compared the outputs for the fipsified/summarized tables and found no differences so this should be safe to go.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a great simplification to me! although tbh i am a little short on context for any of the 714 processing. i would defer to @zaneselvans if there was some special reason for why the demand_hourly_pa_ferc714 should be the source of truth there instead of the settings (and if so we should be an inline comment in there).

But hopefully no, and if the validation tests for the 714 outputs (test/validate/service_territory_test) all pass with this change then I think this is a great.

Copy link

codecov bot commented Nov 7, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (bfe6203) 92.6% compared to head (942c8c8) 92.6%.
Report is 2 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #3024     +/-   ##
=======================================
- Coverage   92.6%   92.6%   -0.0%     
=======================================
  Files        134     134             
  Lines      12570   12560     -10     
=======================================
- Hits       11645   11635     -10     
  Misses       925     925             

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is a bit funny with the FERC-714 data is that all years (2006-2020) are distributed together -- so no matter what years are in the settings, you will have all of the years in the data, at least initially. So the linkage between what years are specified in the settings and what years show up in the data is much more tenuous than for other datasets.

Do we still get identical outcomes from both methods here when the settings only request a single year of data? Or several years, but not all of them?

Looking at the FERC-714 extract and transform modules, I don't see anything that links the years listed in the settings to the years that are actually extracted and transformed. I think it's all the years, all the time, in which I case I think the only time you would get identical outputs from these two methods is when you're doing the full ETL, with all available years in the settings as well as the data.

@rousik
Copy link
Collaborator Author

rousik commented Dec 3, 2023

I forgot to follow up on this. I assume that what we're trying to see is whether both fast/full runs with this change and without it produces identical results.

@rousik
Copy link
Collaborator Author

rousik commented Dec 5, 2023

One thing that is a bit funny with the FERC-714 data is that all years (2006-2020) are distributed together -- so no matter what years are in the settings, you will have all of the years in the data, at least initially. So the linkage between what years are specified in the settings and what years show up in the data is much more tenuous than for other datasets.

Do we still get identical outcomes from both methods here when the settings only request a single year of data? Or several years, but not all of them?

Looking at the FERC-714 extract and transform modules, I don't see anything that links the years listed in the settings to the years that are actually extracted and transformed. I think it's all the years, all the time, in which I case I think the only time you would get identical outputs from these two methods is when you're doing the full ETL, with all available years in the settings as well as the data.

I've now verified that etl_fast outputs for this are equal on this branch compared to dev. To further verify, I have traced the lineage of the affected dataframes. We're modifying annualized_respondents_ferc714 and removing the former input which was demand_hourly_pa_ferc714 and was used to generate valid dates.

Now, demand_hourly_pa_ferc714 is calculated from raw_ferc714__demand_hourly_pa, which is provided by the extract_ferc714 multi-asset and this one filters by the report_yr matching years from the associated ferc714_settings, see:

"report_yr in @ferc714_settings.years"

Hence, using years/dates from settings should result in the same outcome and output-differ confirms this theory.

@zaneselvans
Copy link
Member

Ah okay great -- so the intuitive dependency between the years in the settings and the years that actually get processed (after the initial monolithic CSV extraction) already exists. It looks like @e-belfer added it in the dagsterization of the FERC-714 assets.

@zaneselvans zaneselvans enabled auto-merge (squash) December 12, 2023 20:53
@zaneselvans zaneselvans enabled auto-merge (squash) December 12, 2023 22:27
@zaneselvans zaneselvans merged commit 694032d into dev Dec 12, 2023
15 of 16 checks passed
@zaneselvans zaneselvans deleted the ferc714-optimizations branch December 12, 2023 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants