Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing retirement dates from 860m data #2834

Closed
arengel opened this issue Sep 5, 2023 · 6 comments · Fixed by #2835
Closed

Missing retirement dates from 860m data #2834

arengel opened this issue Sep 5, 2023 · 6 comments · Fixed by #2835
Assignees
Labels
bug Things that are just plain broken. data-loss data that we expect should exist seem to be missing or dropped in pudl tables eia860 Anything having to do with EIA Form 860

Comments

@arengel
Copy link
Collaborator

arengel commented Sep 5, 2023

Describe the bug

The planned_generator_retirement_date and generator_retirement_date columns are empty for monthly_update generator data.

Bug Severity

How badly is this bug affecting you?

  • Medium: This bug is preventing me from adopting PUDL for 860m.

To Reproduce

Read the data from the database downloaded Sept 5, 2023 at noon.

import pandas as pd
import sqlalchemy as sa

gens = pd.read_sql_table(
    table_name="generators_eia860",
    con=sa.create_engine("sqlite:////.../pudl.sqlite").connect(),
).query("data_maturity == 'monthly_update'")
gens_denorm = pd.read_sql_table(
    table_name="denorm_generators_eia",
    con=sa.create_engine("sqlite:////.../pudl.sqlite").connect(),
).query("data_maturity == 'monthly_update'")

Then looking at the specific columns, all the values are pd.NaT.

gens.planned_generator_retirement_date.unique()
Out[1]: array(['NaT'], dtype='datetime64[ns]')

gens_denorm.planned_generator_retirement_date.unique()
Out[2]: array(['NaT'], dtype='datetime64[ns]')

gens.generator_retirement_date.unique()
Out[3]: array(['NaT'], dtype='datetime64[ns]')

gens_denorm.generator_retirement_date.unique()
Out[4]: array(['NaT'], dtype='datetime64[ns]')

Expected behavior

These columns would be populated with retirement and planned retirement data from EIA 860 monthly.

Software Environment?

  • Operating System: MacOS 13.5.1 on ARM
  • Python version: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:33:12) [Clang 15.0.7 ]
  • How did you install PUDL?
    • I didn't install PUDL just read the data in from the sqlite DB.
@arengel arengel added the bug Things that are just plain broken. label Sep 5, 2023
@zaneselvans zaneselvans added eia860 Anything having to do with EIA Form 860 data-loss data that we expect should exist seem to be missing or dropped in pudl tables labels Sep 6, 2023
@zaneselvans
Copy link
Member

Whatever the issue ends up being here, we should make sure we add a new data validation test that checks for entirely NULL values in these columns when selecting just monthly update records.

@aesharpe
Copy link
Member

aesharpe commented Sep 6, 2023

The monthly EIA data that gets subsumed into tables like the generators_eia860 is supposed to have the months delineated in the report_date column, correct? Like 2023-06-01? Unless the data_maturity is not monthly_updates in which case it would all say 2023-01-01. I assumed this was already happening, but it looks like the raw EIA860m data extraction does not preserve the month, just the year.

I'm looking at this specific part of process_raw for the EIA860m extractor:

    def process_raw(self, df, page, **partition):
        """Adds source column and report_year column if missing."""
        df = df.rename(columns=self._metadata.get_column_map(page, **partition))
        if "report_year" not in df.columns:
            df["report_year"] = datetime.strptime(
                list(partition.values())[0], "%Y-%m"
            ).year

In the db table all the monthly data has the same date: 2023-01-01 which is wrong. I feel like this issue can't have always been there...am I crazy?

@zaneselvans
Copy link
Member

No, there shouldn't be any monthly resolution data in the generators_eia860 table (even though the report_date column contains a full date)

The original report dates are monthly, since it's a monthly update, but the data that they're reporting is being modeled as having annual frequency within our database, so I think that just retaining the year is appropriate in the extraction here. We're just using the most recent monthly update to indicate an annual value for the new year that hasn't yet been completely reported. And we only ever use a single monthly snapshot from the EIA-860m in the ETL -- we aren't ever extracting more than one year-month of EIA-860m data.

So I think all of the most recently updated EIA-860M data should have 2023-01-01 as its report date.

Is there any other way that the monthly frequency information might be sneaking in? I could see that messing up a merge later on down the line that causes all of these values to get lost.

@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Sep 6, 2023
@aesharpe
Copy link
Member

aesharpe commented Sep 6, 2023

Looks like the issue had to do with the 860m data column maps. The monthly data retirement dates were getting mapped to retirement_year and retirement_month instead of generator_retirement_year and generator_retirement_month in the mapping spreadsheets. Only the generator_retirement column is included in the metadata for the cleaned version of the generators_eia860 so all the values for retirement date from the monthly data were getting dropped.

I updated the spreadsheet to map to generator_retirement_year and generator_retirement_month which should fix the problem.

@zaneselvans
Copy link
Member

I went ahead and merged #2835 so this can be fixed in the builds tonight. @arengel hopefully tomorrow morning you can grab a new DB with the updated retirement dates.

@github-project-automation github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Sep 6, 2023
@zaneselvans
Copy link
Member

Unfortunately, we now have a problem with building the Docker container that runs the nightly builds. Looking into it now. We can run the ETL locally and hand off a fresh PUDL DB if need be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. data-loss data that we expect should exist seem to be missing or dropped in pudl tables eia860 Anything having to do with EIA Form 860
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants