Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify common mutations in FERC steam plants data #3176

Closed
7 tasks
zschira opened this issue Dec 18, 2023 · 1 comment
Closed
7 tasks

Identify common mutations in FERC steam plants data #3176

zschira opened this issue Dec 18, 2023 · 1 comment

Comments

@zschira
Copy link
Member

zschira commented Dec 18, 2023

Purpose

We want to settle on a final validation metric to feel "good enough" about merging the FERC-FERC plant matching refactor (#3137) into PUDL. Because there is no ground truth to test on, we are exploring multiple possible testing strategies to give confidence in the model. The two strategies we are currently considering are generative testing (create fake data where we know what records should be matched and see how the model does), and metamorphic testing (apply matching to test dataset, mutate that dataset in ways the model should be able to handle, test again verifying results of the matching have not changed beyond a certain threshold). Both of these approaches require us to be able to accurately characterize the types of mutations present in the actual data, and to accurately simulate those mutations.

@zaneselvans found a number of cases where the model seems to not be mismatching records:

bad_plants = {
    # APS Yucca 3 misses the plant in 2021-2022 solely due to name change and a blip in fuel fraction splits. Everything else is consistent. Seems too sensitive.
    "yucca3": lambda steam: (steam.plant_id_pudl == 1013),
    # The columbia power station in Wisconsin seems genuinely complicated. It's a mishmash of different units with different owners.
    # Sometimes discontinuous years. Getting multiple records from the same year assigned the same plant_id_ferc1.
    # Sometimes big changes in the capacity within the same plant_id_ferc1.
    "columbia": lambda steam: (steam.plant_id_pudl == 124),
    # APS West Phoenix 5, data for 2021-2022 are not associated with other records due to different name.
    "west_phoenix_5": lambda steam: (steam.plant_name_ferc1.str.contains(r"west phoenix.*5", regex=True, case=False)),
    # sterling avenue plant is very consistent except for construction_type, which is usually null but has values in a couple of years,
    # so in those years it gets split off from all the other years, e.g. in 2003
    "sterling": lambda steam: (steam.plant_name_ferc1.str.contains(r"sterling", regex=True, case=False)),
    # Jeffrey Energy Center is getting split up badly due to minor changes in the name
    "jeffrey": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    "jeffrey_8pct": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey.*8\%", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    # Jeffrey Energy Center NOT getting split up when it should due to large variations in capacity
    "jeffrey_capacity": lambda steam: (steam.plant_name_ferc1.str.contains(r"jeffrey energy cntr", regex=True, case=False)) & (steam.utility_id_ferc1==255),
    # Belews Creek record for 2022 gets lost. The only change is gas/coal fuel ratio
    "belews": lambda steam: (steam.plant_name_ferc1.str.contains(r"belews", regex=True, case=False)),
    # Minor name changes split valmy 1&2 records.
    # NA fuel fractions in 2013 split valmy 1&2 records.
    "valmy12": lambda steam: (steam.plant_name_ferc1.str.contains(r"valmy.*1.*2", regex=True, case=False)),
    # Many cases of more than one record from the same year getting assigned the same plant_id_ferc1
    "valmy_duplicate_years": lambda steam: (steam.plant_name_ferc1.eq(r"valmy")),
    # Minor name change + null fuel fractions split this plant:
    "niles": lambda steam: (steam.plant_name_ferc1.str.contains(r"niles")) & (steam.plant_type.eq("combustion_turbine")),
    # Minor name change and flaky fuel categorization splits the plant
    "manatee": lambda steam: (steam.plant_name_ferc1.str.contains(r"manatee")) & (steam.plant_type.eq("steam")),
    # HB Robinson steam plant records get split due to whitepsace change in name, flaky capacity & fuel fraction reporting.
    "hb_robinson_steam": lambda steam: (steam.plant_name_ferc1.str.contains(r"h.*b.*robinson", case=False, regex=True)) & (steam.plant_type.eq("steam")),
}

Most of these cases involve somewhat minor changes in spelling, or abbreviations in the plant names. This is one type of change that we know to be quite common, and should certainly be testing for.

Simulating feature columns

For each feature column used in the model, we should roughly characterize how that column varies, and develop a strategy for simulating that variation.

  • plant_name_ferc1

Currently, the generative approach attempts to simulate plant names by applying random edits to the name. Specifically, for each record it generates it takes the "nominal" plant name, randomly selects a number of edits between 0 and k (k is configurable), then randomly selects a type of edit (add/delete/replace a character), and applies that edit. The number of edits is weighted towards 0, so it will sometimes apply k edits, but it is more likely to apply fewer, and when it adds or replaces a character, it can select a special character, or white space for the new character, but is more likely to select a letter. These edits seem somewhat inline with what we actually see, but the number of edits and weighting could be fine tuned.

  • plant_type

The current strategy for simulating plant type in the generative approach is to randomly select a different plant type from the set of categories for about 1% or records. This might not be the best approach, because there are cases where there actually are multiple plant types for plants with the same name, and these should end up in separate clusters. Maybe it would be better to just randomly nullify a small percentage of records.

  • construction_type

Same as plant type.

  • capacity_mw

Capacity is simulated by adding random noise to a subset of records. It does seem like there are cases where the capacity varies slightly around a nominal value like this, however in many cases it is pretty constant with occasional significant changes if the physical plant actually changes, so the current approach might be insufficient.

  • construction_year

Same as plant type.

  • utility_id_ferc1

Same as plant type.

  • fuel_fractions

To simulate fuel fractions, the generative test will create a random l1 unit vector (all of the fractions add up to 1), then apply random noise to the vector. This is certainly not representative of the actual data, given that plants don't just have a random distribution of fuel types. Possibly a better solution would be to randomly select a primary fuel source, then have some rules for when to select a secondary source.

@zschira zschira added the bug Things that are just plain broken. label Dec 18, 2023
@zschira zschira removed the bug Things that are just plain broken. label Dec 18, 2023
@zschira zschira moved this from New to In progress in Catalyst Megaproject Dec 18, 2023
@zaneselvans
Copy link
Member

capacity_mw

For capacity_mw I think there are two kinds of changes that happen, and as a human identifying plant time series, they're different in important ways. Sometimes the capacity of a plant will vary slightly as minor changes or refinements are made, but it's still the same set of generation units. Other times, whole new generation units will be added or old units will be retired, typically resulting in a capacity change that's a significant portion of the overall plant's capacity. When that kind of larger change happens, we should probably be disassociating the two time periods into two different plant_id_ferc1 values, because the two portions of the timeseries no longer represent directly comparable "plants".

construction_year & installation_year

The construction_year and installation_year values can indicate similar plant composition changes, since (IIRC) construction_year is the year when the oldest still active unit was put in service, while installation_year is the year that the newest active unit was put in service. So when these numbers change it should represent a retirement or an addition. Seeing construction_year increase while capacity_mw decreases would likely indicate a retirement. And seeing installation_year increase while capacity_mw increases would likely indicate a new addition. Of course both could happen in the same year and make it difficult to understand the net effect, but in any case, that would probably not be a plant we'd consider directly comparable to the previous report_year.

plant_type & construction_type

Unfortunately, plant_type and construction_type are kind of garbage piles as they are initially reported -- they're freeform strings that we do our best to categorize as human beings, and they are not very specific, since a large fraction of the records share just a few values like steam.

utility_id_ferc1

What's the case for allowing records with different utility_id_ferc1 values to be categorized within the same plant_id_ferc1?

@zschira zschira closed this as completed Mar 5, 2024
@github-project-automation github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants