Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core_eia860m__changelog_generators contains NULL valid_until_date values #3614

Closed
TrentonBush opened this issue May 6, 2024 · 4 comments · Fixed by #3694
Closed

core_eia860m__changelog_generators contains NULL valid_until_date values #3614

TrentonBush opened this issue May 6, 2024 · 4 comments · Fixed by #3694
Labels
bug Things that are just plain broken.

Comments

@TrentonBush
Copy link
Contributor

Describe the bug

core_eia860m__changelog_generators contains 950+ NULL values in the valid_until_date column. I don't think they should exist because this column is derived from report_date and none of those are NULL. Most of the NULLs (700+) are from the last report_date in the series (currently 2023-12-01), but ~200 others exist throughout the time span of the data. Maybe it has to do with the last date in each generator's series?

Bug Severity

This table is encoded as a changelog between two dates -- valid_until_date specifies one of the boundaries. It is hard to interpret the data without that boundary.

To Reproduce

SELECT
    valid_until_date,
    count(*) as count
FROM core_eia860m__changelog_generators
GROUP BY 1
ORDER BY 1 desc

Software Environment?

in pudl.sqlite version v2024.2.6

@TrentonBush TrentonBush added the bug Things that are just plain broken. label May 6, 2024
@cmgosnell
Copy link
Member

Just checked and all of the max dates do indeed have null valid_until_date:

image

because of this line

i tried deleting it and now we have full coverage:
image

@TrentonBush
Copy link
Contributor Author

eia860m_changelog["valid_until_date"] = (
    eia860m_changelog.sort_values(gens_idx, ascending=False)
    .groupby(gen_idx_no_date)["report_date"]
    .transform("shift")
    .fillna(eia860m_changelog.report_date_max)
    .where(eia860m_changelog["report_date"] != eia860m_changelog["report_date_max"])
)

I think that shift/fillna(max_date) combo assumes that all the timeseries are contiguous and end at the most recent reporting date. I think some of them end earlier or have reporting gaps, so the fillna() value might be a little more complicated

@cmgosnell
Copy link
Member

we use pudl.helpers.expand_timeseries before this to make sure there is a continuous time series! and iirc the reason we added this last where clause is because we wanted to define the valid_until_date as the following:

"The record in the changelog is valid until this date. The record is valid from the report_date up until but not including the valid_until_date."

If we want to keep this definition then the where clause should probably stay as is and we should fill the records in the where with report_date + one month.

@TrentonBush
Copy link
Contributor Author

If we want to keep this definition then the where clause should probably stay as is and we should fill the records in the where with report_date + one month.

That makes sense to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants