Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to run transformation of new year of FERC 1 data and document results #3698

Closed
3 tasks done
Tracked by #3544
cmgosnell opened this issue Jun 25, 2024 · 1 comment
Closed
3 tasks done
Tracked by #3544
Assignees
Labels
data-update When fresh data is integrated into PUDL from quarterly or annual updates ferc1 Anything having to do with FERC Form 1 rmi

Comments

@cmgosnell
Copy link
Member

cmgosnell commented Jun 25, 2024

Tasks

Note: I used the new metadata only. presumably we will still want to do this #3650

Failures

🔴 core_ferc1__yearly_hydroelectric_plants_sched406 -> Easy 🟢

Stage: enforce_schema (last step!)

Error:

2024-06-25 15:58:00 [    INFO] catalystcoop.pudl.transform.classes:1361 core_ferc1__yearly_hydroelectric_plants_sched406: Enforcing database schema on dataframe.
2024-06-25 15:58:00 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46621 - core_ferc1__yearly_hydroelectric_plants_sched406 - STEP_FAILURE - Execution of step "core_ferc1__yearly_hydroelectric_plants_sched406" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_hydroelectric_plants_sched406"::

ValueError: invalid literal for int() with base 10: '.2175': Error while type casting for column 'project_num'

Why: There is a decimal in the project_num column and we have this column dtyped as an int

Solution: Easy

  1. If we think decimal is fine, change dtype
  2. If we think decimal is type, bespoke replace with non-decimal
    CG tried option 2 locally and... there are more than one of these.. I used removeprefix and it worked locally:
    def transform_main(self, df):
        """Add bespoke removal of duplicate record after standard transform_main."""
        df = super().transform_main(df).pipe(self.targeted_drop_duplicates)
        df.project_num = df.project_num.str.removeprefix(".")
        return df

🔴 core_ferc1__yearly_operating_revenues_sched300 -> Easy 🟢

Stage: apply_xbrl_calculation_fixes (early)

2024-06-25 15:58:37 [    INFO] catalystcoop.pudl.transform.ferc1:2027 core_ferc1__yearly_operating_revenues_sched300: Processing XBRL metadata.
2024-06-25 15:58:37 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46684 - core_ferc1__yearly_operating_revenues_sched300 - STEP_FAILURE - Execution of step "core_ferc1__yearly_operating_revenues_sched300" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_operating_revenues_sched300"::

AssertionError: We've applied 14 calculation fixes while we started with 13. Length of applied and original fixes should be the same.
replace_me=                                                                                                                                               weight utility_type plant_function plant_status
table_name_parent                              xbrl_factoid_parent         table_name                                     xbrl_factoid                                                        
core_ferc1__yearly_operating_revenues_sched300 sales_to_ultimate_consumers core_ferc1__yearly_operating_revenues_sched300 large_or_industrial     1.0          NaN            NaN          NaN
                                                                                                                          small_or_commercial     1.0          NaN            NaN          NaN
add_me=                                                                                                                                                                                  weight utility_type plant_function plant_status
table_name_parent                              xbrl_factoid_parent         table_name                                         xbrl_factoid                                                                                       
core_ferc1__yearly_operating_revenues_sched300 other_operating_revenues    core_ferc1__yearly_operating_revenues_sched300     forfeited_discounts                                    1.0          NaN            NaN          NaN
                                                                                                                              interdepartmental_rents                                1.0          NaN            NaN          NaN
                                                                                                                              miscellaneous_revenue                                  1.0          NaN            NaN          NaN
                                                                                                                              miscellaneous_service_revenues                         1.0          NaN            NaN          NaN
                                                                                                                              other_electric_revenue                                 1.0          NaN            NaN          NaN
                                                                                                                              other_miscellaneous_operating_revenues                 1.0          NaN            NaN          NaN
                                                                                                                              regional_transmission_service_revenues                 1.0          NaN            NaN          NaN
                                                                                                                              rent_from_electric_property                            1.0          NaN            NaN          NaN
                                                                                                                              revenues_from_transmission_of_electricity_of_ot...     1.0          NaN            NaN          NaN
                                                                                                                              sales_of_water_and_water_power                         1.0          NaN            NaN          NaN
                                               sales_to_ultimate_consumers core_ferc1__yearly_sales_by_rate_schedules_sche... commercial_and_industrial                              NaN          NaN            NaN          NaN
delete_me=                                                                                                                                                         weight utility_type plant_function plant_status
table_name_parent                              xbrl_factoid_parent         table_name                                         xbrl_factoid                                                              
core_ferc1__yearly_operating_revenues_sched300 sales_to_ultimate_consumers core_ferc1__yearly_sales_by_rate_schedules_sche... commercial_and_industrial     NaN          NaN            NaN          NaN

Why: It looks like commercial_and_industrial is trying to be added and deleted. Which means many the calculation components for this table changed with the new metadata.

Solution: Easy/Mild
Go track down what changed in the metadata and update xbrl_calculation_component_fixes.csv accordingly. So i double checked and it does look like this commercial_and_industrial calculation component that used to be in the metadata's calculation which we had been removing is now just removed! which is great. it means they fixed something 😎 . So the solution here is just to delete the line in the xbrl_calculation_component_fixes. CG did this locally and the rest of the transform transformed 🟢

🔴 core_ferc1__yearly_plant_in_service_sched204

Stage: pudl.io_managers.FercXBRLSQLiteIOManager.filter_for_freshest_data (vv early)

Error:

2024-06-25 15:58:37 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46714 - core_ferc1__yearly_plant_in_service_sched204 - STEP_FAILURE - Execution of step "core_ferc1__yearly_plant_in_service_sched204" failed.

dagster._core.errors.DagsterExecutionLoadInputError: Error occurred while loading input "raw_xbrl_duration__electric_plant_in_service_204" of step "core_ferc1__yearly_plant_in_service_sched204"::

ValueError: Found 10505 non-null values with apply-diffsmethodology, and 10472 with best-snapshot. apply-diffs shouldn't be more than 0.3% greater than best-snapshot.

Why: Our apply-diffs methodology takes the last non-null value for a fact. Which is probably mostly fine, except when a respondent comes back in to null a previously reported value. This check is to ensure that we don't get too many more non-null records than is suggested with our best-snapshot methodology. In this run we got 0.31% greater non-null records using the apply-diffs methodology as compared to the best-snapshot and we expect that threshold to be only 0.3% or less.

Solutions: Medium
0. (temp) Up the threshold w/o looking just to see if the rest transform passes -> 🟢 it does! wahoo

  1. Go check in on these extra non-null records to actually make a recommendation. Probably one of: up the threshold, change the filter_for_freshest_data method?
  2. Also, update the error message to be more clear.

🔴 core_ferc1__yearly_steam_plants_fuel_sched402 -> Easy 🟢

Stage: assign_record_id (mid)

Error:

2024-06-25 15:58:52 [    INFO] catalystcoop.pudl.transform.classes:1276 core_ferc1__yearly_steam_plants_fuel_sched402: Categorizing string columns using a controlled vocabulary.
2024-06-25 15:58:52 [ WARNING] catalystcoop.pudl.transform.classes:474 fuel_type_code_pudl: Found 1 uncategorized values: {'natural gas/fuel oil'}
2024-06-25 15:58:52 [ WARNING] catalystcoop.pudl.transform.classes:474 fuel_units: Found 3 uncategorized values: {'coal - ton', 'oil- bbl', 'gas - mcf'}
2024-06-25 15:58:52 [    INFO] catalystcoop.pudl.transform.ferc1:3304 core_ferc1__yearly_steam_plants_fuel_sched402: Aggregating 88 rows with duplicate primary keys out of 4136 total rows.
2024-06-25 15:58:52 [    INFO] catalystcoop.pudl.transform.ferc1:3308 core_ferc1__yearly_steam_plants_fuel_sched402: Dropping 352 records with inconsistent fuel units preventing aggregation out of 4136 total rows.
2024-06-25 15:58:52 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46957 - core_ferc1__yearly_steam_plants_fuel_sched402 - STEP_FAILURE - Execution of step "core_ferc1__yearly_steam_plants_fuel_sched402" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_steam_plants_fuel_sched402"::

ValueError: core_ferc1__yearly_steam_plants_fuel_sched402 (xbrl): Found null primary key values.
report_year              False
utility_id_ferc1_xbrl    False
fuel_type_code_pudl       True
plant_name_ferc1         False
dtype: bool

Why: The categorization of the fuel_type_code_pudl during categorize_strings nulled one string which caused this error in defining the record_id bc the fuel type is a part of the pk of the original table/

Solution: Easy
Add the string into the FUEL_CATEGORIES
(CG tried this locally and it fixed the problem)

what hasn't been tested bc of these failures?

  • core_ferc1__yearly_plant_in_service_sched204 is one of the inputs to the exploded/detail table. So all of that stuff feeding into the rate base table isn't yet being tested. All of the intra-table calculations are being tested! Just none of the post-transform metadata/calculation tables or the detialed -> rate base assets
    • okay now that we have at least a local solution to all of the failures above running everything is now possible. Ahhh they are all green circles 🟢 🎉
@cmgosnell cmgosnell changed the title Try to run transformation of new year of FERC 1 data (contingent on extraction) and document results Try to run transformation of new year of FERC 1 data and document results Jun 25, 2024
@cmgosnell cmgosnell self-assigned this Jun 25, 2024
@cmgosnell cmgosnell added ferc1 Anything having to do with FERC Form 1 rmi data-update When fresh data is integrated into PUDL from quarterly or annual updates labels Jun 25, 2024
@cmgosnell cmgosnell moved this from New to Backlog in Catalyst Megaproject Jun 25, 2024
@cmgosnell cmgosnell moved this from Backlog to In progress in Catalyst Megaproject Jun 25, 2024
@cmgosnell
Copy link
Member Author

Okay the tl;dr is that there are four failures. three are vv simple and already fixed locally. the fourth failure needs a little investigation but doesn't scream red flags.

@github-project-automation github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Jun 25, 2024
This was referenced Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-update When fresh data is integrated into PUDL from quarterly or annual updates ferc1 Anything having to do with FERC Form 1 rmi
Projects
Archived in project
Development

No branches or pull requests

1 participant