-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify common mutations in FERC steam plants data #3176
Comments
|
Purpose
We want to settle on a final validation metric to feel "good enough" about merging the FERC-FERC plant matching refactor (#3137) into PUDL. Because there is no ground truth to test on, we are exploring multiple possible testing strategies to give confidence in the model. The two strategies we are currently considering are generative testing (create fake data where we know what records should be matched and see how the model does), and metamorphic testing (apply matching to test dataset, mutate that dataset in ways the model should be able to handle, test again verifying results of the matching have not changed beyond a certain threshold). Both of these approaches require us to be able to accurately characterize the types of mutations present in the actual data, and to accurately simulate those mutations.
@zaneselvans found a number of cases where the model seems to not be mismatching records:
Most of these cases involve somewhat minor changes in spelling, or abbreviations in the plant names. This is one type of change that we know to be quite common, and should certainly be testing for.
Simulating feature columns
For each feature column used in the model, we should roughly characterize how that column varies, and develop a strategy for simulating that variation.
plant_name_ferc1
Currently, the generative approach attempts to simulate plant names by applying random edits to the name. Specifically, for each record it generates it takes the "nominal" plant name, randomly selects a number of edits between 0 and
k
(k
is configurable), then randomly selects a type of edit (add/delete/replace a character), and applies that edit. The number of edits is weighted towards 0, so it will sometimes applyk
edits, but it is more likely to apply fewer, and when it adds or replaces a character, it can select a special character, or white space for the new character, but is more likely to select a letter. These edits seem somewhat inline with what we actually see, but the number of edits and weighting could be fine tuned.plant_type
The current strategy for simulating plant type in the generative approach is to randomly select a different plant type from the set of categories for about 1% or records. This might not be the best approach, because there are cases where there actually are multiple plant types for plants with the same name, and these should end up in separate clusters. Maybe it would be better to just randomly nullify a small percentage of records.
construction_type
Same as plant type.
capacity_mw
Capacity is simulated by adding random noise to a subset of records. It does seem like there are cases where the capacity varies slightly around a nominal value like this, however in many cases it is pretty constant with occasional significant changes if the physical plant actually changes, so the current approach might be insufficient.
construction_year
Same as plant type.
utility_id_ferc1
Same as plant type.
fuel_fractions
To simulate fuel fractions, the generative test will create a random l1 unit vector (all of the fractions add up to 1), then apply random noise to the vector. This is certainly not representative of the actual data, given that plants don't just have a random distribution of fuel types. Possibly a better solution would be to randomly select a primary fuel source, then have some rules for when to select a secondary source.
The text was updated successfully, but these errors were encountered: