Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scope 2023 FERC 1 data integration #3544

Closed
5 tasks done
aesharpe opened this issue Apr 5, 2024 · 7 comments
Closed
5 tasks done

Scope 2023 FERC 1 data integration #3544

aesharpe opened this issue Apr 5, 2024 · 7 comments
Assignees
Labels
data-update When fresh data is integrated into PUDL from quarterly or annual updates ferc1 Anything having to do with FERC Form 1 rmi

Comments

@aesharpe
Copy link
Member

aesharpe commented Apr 5, 2024

Not all of the FERC data is in, but there's enough to test the extractor and see if there are any major changes to the data format that would beget a refactor

Keep this exploration cursory. Don't try and fix any major errors, just SCOPE them. Check-in after attempt to extract new year of data and then proceed to transformation if fixes are strait forward / RMI gives explicit permission to address them.

Success criteria

  • FERC 1 2023 data fully transformed
  • Graphs of FERC 1 data corresponding to failing validation tests

Tasks

  1. ferc1 xbrl zenodo
    zschira
  2. 7 of 8
    ferc1 new-data rmi xbrl
    e-belfer jdangerx
  3. 0 of 3
    ferc1 xbrl
    zschira
  4. 0 of 2
    ferc1 xbrl
    zschira
  5. 3 of 3
    data-update ferc1 rmi
    cmgosnell
@aesharpe aesharpe converted this from a draft issue Apr 5, 2024
@aesharpe aesharpe added ferc1 Anything having to do with FERC Form 1 new-data Requests for integration of new data. rmi labels Apr 5, 2024
@e-belfer e-belfer self-assigned this Apr 9, 2024
@jdangerx
Copy link
Member

jdangerx commented May 7, 2024

FYI @aesharpe

@zschira and I just spent some time looking into these issues - broadly we see three areas of changes: archival, extraction, and downstream transforms.

archival

These changes enable the extractor changes below, which then will fix the issues found in #3575.

We expect this to take roughly 10h.

  • add filename to metadata
  • add taxonomy to metadata
  • name taxonomy zips with full date

extraction

These changes are a little more entangled with the rest of the system - we expect them to take about 15h. Combined with the archiver changes above, this should address all the issues found in #3575 .

  • read taxonomy and filename from the metadata
  • chunk based on which taxonomy is being used

transformation

This is the first time we are encountering a new schema for the FERC1 data - which means the transformations will likely need various fixes to get them running again, in addition to the normal utility mapping, validation tests/data investigation, etc.

Last year, according to Harvest, it took us 240 hours to handle all of the 2022 data - probably about half of that dedicated to FERC (looks like I put 90h or so into it).

It was easier, since we didn't have a new schema to deal with, but also harder, in that it was the first time dealing with multiple years of XBRL data. I would be comfortable budgeting for 120 hours of time to integrate the new FERC data, though I know that sounds like a huge number and I really hope it takes less time.

Overall that means 145 hours of work.

I think we should aim to complete this in June or July - May is already quite booked. As of right now, we see 221 filings for 2023 in the latest archive, which is 85% of the number of filings for 2022. So I do also expect some more filings to trickle in over the next month or so.

@cmgosnell
Copy link
Member

hey @jdangerx when i was trying to compile the general pudl maintenance budget (an internal doc for now!) i looked into the 2022 update and the ferc hours a bit. I pulled out all of the time entries that had either "ferc" or "xbrl" in there and it was just under 60 hours. It included the saving of a bunch of recrds we had previously been dropping during the archiving... which was hopefully a one-off but actually seems pretty representative for FERC weirdness. So I think this estimate is high!

@jdangerx
Copy link
Member

jdangerx commented May 9, 2024

I went through and looked at each of the time entries under software development or data wrangling for the FERC 2022 project - there was a bunch of stuff that didn't explicitly have the words FERC or XBRL that was related to one of those things. Fortunately most of the work was done by me so I understand the cryptic notes I left 😅

It all sums to:

Dealing with utility mapping, FERC transforms, investigating why validation tests were failing and whether that's a real problem: 64h

Dealing with the duplicate data issue from having multiple filings from different years representing the same data: 19h

Additionally Austen looks like she spent about 3h on FERC stuff. So that all comes out to 86h.

Intuitively, a new data schema seems like a big source of uncertainty, I think we should have some buffer above "what we did last year." 120 does still sound high but not outrageous.

To get a sense of how much the schemas really changed, I could probably pull like an hour to:

  • make Datapackages from each taxonomy
  • see what the differences between them actually are: new tables? same tables with new fields? etc.

That would probably be a good investment of our scoping time.

@cmgosnell cmgosnell moved this from Backlog to In progress in Catalyst Megaproject May 15, 2024
@jdangerx
Copy link
Member

There are a total of 10 tables which appear to have some sort of schema change between the 2022-01-01, 2023-04-01, and 2023-11-01 FERC 1 taxonomies.

None of these tables appear to be actively used in our current FERC1 ETL - I pulled the list of used tables from pudl.extract.ferc1.TABLE_NAME_MAP_FERC1. That means I don't expect to have to make big changes to the transform code to handle the new year's data.

We will, however, still have to investigate validation errors for the new data, do the utility/plant mapping, etc. Which will probably take 30-40h depending on how thoroughly we want to investigate the validation errors.

Here's the rundown of the schema changes:

Changes from adding energy storage operations data for small plants

  • list_of_schedules_002_duration
    • added in 2023-04:
      • deprecated_items_abstract: str
      • schedule_energy_storage_operations_small_plants_abstract: str
  • energy_storage_operations_large_plants_414_duration
    • rename in 2023-11: power_purchased_for_storage_operations -> power_purchased_for_storage_operations_large_plants
  • energy_storage_operations_large_plants_totals_414_duration
    • rename in 2023-11: power_purchased_for_storage_operations -> power_purchased_for_storage_operations_large_plants
  • energy_storage_operations_small_plants_419_instant
    • completely new in 2023-04!
    • added fields in 2023-04:
      • publication_time: str
      • date: str
      • filing_name: str
      • entity_id: str
  • energy_storage_operations_small_plants_419_duration
    • completely new in 2023-04!
    • added fields in 2023-04:
      • start_date: str
      • other_expenses: float
      • fuel_used_in_storage_operations: float
      • energy_storage_project_name: str
      • entity_id: str
      • operation_expense_excluding_fuel: float
      • publication_time: str
      • power_purchased_for_storage_operations: float
      • filing_name: str
      • end_date: str
      • project_location: str
      • energy_storage_operations_axis: str
      • functional_classification_energy_storage_operations: str
      • order_number: float
      • maintenance_expense: float
      • project_plant_cost: float
    • field renames in 2023-11:
      • power_purchased_for_storage_operations -> power_purchased_for_storage_operations_small_plants
      • maintenance_expense -> maintenance_expense_energy_storage_operations_small_plants
  • energy_storage_operations_small_plants_totals_419_instant
    • completely new in 2023-04!
    • added fields in 2023-04:
      • publication_time: str
      • energy_storage_operations_axis: str
      • filing_name: str
      • date: str
      • entity_id: str
  • energy_storage_operations_small_plants_totals_419_duration
    • completely new in 2023-04!
    • added fields in 2023-04:
      • start_date: str
      • other_expenses: float
      • fuel_used_in_storage_operations: float
      • end_date: str
      • entity_id: str
      • power_purchased_for_storage_operations: float
      • operation_expense_excluding_fuel: float
      • publication_time: str
      • maintenance_expense: float
      • filing_name: str
      • project_plant_cost: float
    • field renames in 2023-11:
      • power_purchased_for_storage_operations -> power_purchased_for_storage_operations_small_plants
      • maintenance_expense -> maintenance_expense_energy_storage_operations_small_plants

Removing electricity sales by rate schedules in C&I in 2023-04

  • sales_of_electricity_by_rate_schedules_account_442_total_commercial_and_industrial_304_instant
    • completely removed in 2023-04
  • sales_of_electricity_by_rate_schedules_account_442_total_commercial_and_industrial_304_duration
    • completely removed in 2023-04
  • sales_of_electricity_by_rate_schedules_account_totals_304_duration
    • removed fields in 2023-04:
      • commercial_and_industrial_sales: float
      • megawatt_hours_sold_commercial_and_industrial_sales: float
      • kilowatt_hours_of_sales_per_customer_commercial_and_industrial_sales: float
      • megawatt_hours_sold_commercial_and_industrial_sales_unbilled: float
      • average_number_of_customers_per_month_commercial_and_industrial_sales: int
      • revenue_per_kilowatt_hour_sold_commercial_and_industrial_sales_billed: float
      • kilowatt_hours_of_sales_per_customer_commercial_and_industrial_sales_billed: float
      • revenue_per_kilowatt_hour_sold_commercial_and_industrial_sales_unbilled: float
      • megawatt_hours_sold_commercial_and_industrial_sales_billed: float
      • commercial_and_industrial_sales_unbilled: float
      • average_number_of_customers_per_month_commercial_and_industrial_sales_billed: int
      • revenue_per_kilowatt_hour_sold_commercial_and_industrial_sales: float
      • commercial_and_industrial_sales_billed: float

@aesharpe aesharpe added data-update When fresh data is integrated into PUDL from quarterly or annual updates and removed new-data Requests for integration of new data. labels Jun 17, 2024
@e-belfer e-belfer removed their assignment Jun 17, 2024
@zaneselvans
Copy link
Member

Looks like this issue can be closed?

Maybe we want to ask @arengel or @jrea-rmi whether the brand new energy storage tables are of interest.

@arengel
Copy link
Collaborator

arengel commented Aug 7, 2024

@zaneselvans, this data looks to be useful but it's not critical for any near-term work I am aware of. So I think we are likely to want to get the storage tables incorporated in the future, though not now. But I do have 2 questions related to this data specifically and decisions like this generally:

  1. On the energy storage tables, do they have capacity in MW and duration / capacity in MWh? If not does another table have that data in a way that doesn't require a a matching exercise to pull in?

  2. Do you see incorporating this table as fitting under PUDL maintenance / updates or would this be additional, at least as our arrangements stands now? I assume that a future PUDL patron setup would define how decisions about and payment for this sort of work would... work.

@zaneselvans
Copy link
Member

@arengel taking a quick look at the tables in Datasette:

I note that:

  • As usual, the small plants look messy. Lots of freeform strings and multi-row records organized with an ordering column.
  • As with most of the FERC data vs. EIA, it's more focused on costs, plus some overall aggregate values for energy stored / discharged, power purchases, total amount of energy lost, etc.
  • Per-plant operational characteristics like capacity (MW) and MWh of storage don't seem to be in here.

If you want those attributes, I suspect you should look at the EIA generator + energy storage data instead:

Currently the generic generator fields like capacity_mw and (somewhat oddly) energy_storage_capacity_mwh lives in the generic generator tables like core_eia860__scd_generators, and only the storage, wind, and solar specific generator fields fall into the specific storage, wind, and solar generator tables so you would need to join them together to combine energy_storage_capacity_mwh and max_charge_rate_mw and max_discharge_rate_mw to be able to calculate the duration of storage available (presumably energy_storage_capacity_mwh / max_discharge_rate_mw?)

I could definitely see creating a dedicated energy storage output table so that this kind of information is available off the shelf without introducing a huge number of additional columns into the big generator output table that only pertain to storage, wind, or solar.

I think that adding these tables would probably fall into the "stretch goals" part of what we're envisioning. It's an incremental improvement beyond the minimum viable PUDL maintenance, but should be low-risk and of general utility. So, if there are enough supporters beyond the baseline requirements to keep the PUDL data flowing, adding these tables would be on the menu of improvements those beyond-baseline funds could be allocated to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-update When fresh data is integrated into PUDL from quarterly or annual updates ferc1 Anything having to do with FERC Form 1 rmi
Projects
Archived in project
Development

No branches or pull requests

6 participants