Update ferc-ferc plant matching with ccai implementation. #3007

zschira · 2023-11-02T19:46:32Z

Closes catalyst-cooperative/ccai-entity-matching#109 .

This PR pulls @katie-lamb's CCAI implementation of the FERC-FERC inter-year plant matching process. This new implementation works well, and seems to be running much faster than the old implementation (~2 seconds vs ~8 seconds on the etl_fast dataset).

Remaining work:

Verify new model runs in full etl
Clean up code after dagster integration
Get review of current implementation after model optimizations and dagster integration
Run validation tests locally

katie-lamb · 2023-11-05T02:31:01Z

This looks good so far! Is it the PCA or the clustering that blows up memory?

katie-lamb · 2023-11-08T16:01:33Z

@zschira I made some somewhat confusing but hopefully helpful plots to better understand the distance between clusters. With the fittedsklearn Agglomerative Clustering model one can make a scipy dendrogram chart (see this example). To understand what a dendrogram is, I thought this example was helpful. That being said, I'm still not entirely sure how to interpret these plots, and would love some other ideas if I've missed the mark here.

These plots are run on the small subset of data I've been using (2000 records), and shows the clusters that are p=40 merges from the final merge. The y-axis shows the distance at which two nodes are merged into one, indicated by a bracket connecting them. I recommend ignoring the labels on the x-axis, but it basically represents the size of each node. The red horizontal line indicates the threshold I've been using for two clusters to be merged (currently .5). The merges made above this line didn't happen and aren't represented in the labels. The merges below this bar are represented by the labels and were clusters that were merged to form a new cluster. I progressively zoomed in on the y-axis so we can see what's going on.

There's a big jump from threshold of 10 to threshold of 2. I think this means that 10 is way too large of a cluster distance threshold.

Zooming in on a y axis of 0 to 1, it still seems like a lot of merging happens at a much smaller distance. Maybe it's an indication that the threshold should actually be lower? Maybe it doesn't matter too much if the threshold is .5 or .1, still thinking about that and not entirely sure what to make of it.

More merging happening in the <.05 range. But that's also to be expected. There's maybe more clustering of larger nodes (bigger on the x-axis), which is good.

It's a little harder to visualize results with the model run on the full dataset, but for the most part it seems like results align with the smaller sample dataset. I ran this with p=20, so 20 merges from the final merge, because it was impossible to tell what was going on with p=40.

zaneselvans · 2023-11-08T16:51:02Z

I'm definitely confused here. With several thousand expected clusters (one for each FERC plant) it seems hard to visualize all of them at once.

Have you made histograms of the cluster sizes in the old vs. new systems?

If you randomly select a plant_id_ferc1 out of the dataframe post ID assignment, do the records look like they belong to the same plant?

zschira · 2023-11-08T19:29:57Z

@zaneselvans I've spot checked a number of plant id's and so far they've all looked like the same plant to me.

I think doing some focused spot checking using the dendrogram as a guide would be interesting. For example, look at clusters that merge just above our threshold at the 0.5-0.6 range and see if they look like the same plant or not, or do the same just below the threshold, and then maybe zoom way in to clusters with very small distances, and see if we can find any that don't seem to be general matches. I guess generally, it would be interesting to find some cases where we're clearly failing (matching plants that shouldn't be matched or not matching plants that should), and see where/why those might have gone wrong.

katie-lamb · 2023-11-09T01:20:52Z

@zaneselvans

I'm definitely confused here. With several thousand expected clusters (one for each FERC plant) it seems hard to visualize all of them at once.

You can think of the dendrograms as a sample of the several thousand clusters that are created by the model. In our case, the p=40 parameter is not that meaningful, since I'm instead using a distance threshold to decide when merging should stop. I think the dendrogram is mostly helpful to understand if the distance threshold for merging is appropriate. As @zschira pointed out, the second step for validating is probably to spot check nodes that are merged right below or right above that .5 mark.

This also verifies why the average distance between records in a cluster is always very small (.05 or less) even when I experiment with a distance threshold in the .2-1 range. The vast majority of merges happen between nodes with a distance <.05, so looking at the average distance between records in a cluster isn't a very helpful metric for verifying whether the threshold is good in the .2-1 range.

Have you made histograms of the cluster sizes in the old vs. new systems?

Yes, currently in the notebook in the CCAI repo. Here's a comparison from the smaller sample dataset (2000 records, I didn't have a screenshot of the full model histogram), where the average new cluster size is ~5.8 and the average old plant_id_ferc1 size is ~4.3. This disparity is more pronounced on the whole dataset, where the new average cluster size is ~7.3.

I'll do some spot checking around the current distance threshold as Zach suggests.

zaneselvans · 2023-11-09T03:40:16Z

I think the y-axis label might belong on the x-axis?

It's interesting that there's a bump at the very high end of the length spectrum (like, one record for every possible year) in the old version, but not in the new version. I wonder why that would have happened, and whether we've actually lost some good long time series, or if they were bad for some reason and the new algorithm does a better job of distinguishing them?

katie-lamb · 2023-11-10T00:45:31Z

I think the y-axis label might belong on the x-axis?

Oh yep, I did that too fast, but switch the label from y-axis to x-axis.

It's interesting that there's a bump at the very high end of the length spectrum (like, one record for every possible year) in the old version, but not in the new version. I wonder why that would have happened, and whether we've actually lost some good long time series, or if they were bad for some reason and the new algorithm does a better job of distinguishing them?

That's a good point, and that disparity at the high end of the length spectrum happens in the full dataset as well. I'll spot check some of those.

…l into entity_matching

katie-lamb

I think this looks good! I left a few small comment in the cross_year module. As sort of a side note, it might be useful for us to keep track of a couple validation checks besides just if there are duplicate report years. That would probably belong in experiment tracking infrastructure and metrics I'm assuming, and would go in with a later PR.

src/pudl/analysis/record_linkage/cross_year.py

katie-lamb

Sorry, didn't finish making all comments at once. Here's a few more things.

src/pudl/analysis/record_linkage/cross_year.py

src/pudl/analysis/record_linkage/classify_plants_ferc1.py

zaneselvans · 2023-11-21T03:45:37Z

I think you might need to add an __init__.py to the new record_linkage subpackge to make it importable, and bring in all of the new modules, and add an import record_linkage into src/pudl/analysis/__init__.py to reflect the new subpackage.

…tive/pudl into entity_matching

test/integration/record_linkage.py

zaneselvans · 2023-12-22T21:54:38Z

It looks like when the modules under pudl.analysis were rearranged, the new import paths were not added to pudl.analysis.__init__.py so they were not importable by other modules, which was causing the use of pudl.analysis.fuel_by_plant functions to fail in the steam table processing.

I'm running the full ETL locally using the etl_full job and also seeing that EPA CEMS is using all 10 of my CPUs rather than limiting itself to a concurrency of 2. Did something get changed that makes that safe? Is the EPA CEMS concurrency setting not working in the Dagster UI?

zaneselvans · 2023-12-22T22:09:57Z

After including the new import paths, I get a failure on the plant parts EIA:

AssertionError: Consistency of all matches across years dipped below 75.0% to 53.5%
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/eia_ferc1_record_linkage.py", line 94, in out_pudl__yearly_assn_eia_ferc1_plant_parts
    connects_ferc1_eia = prettyify_best_matches(
                         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/eia_ferc1_record_linkage.py", line 888, in prettyify_best_matches
    check_match_consistency(
  File "/Users/zane/code/catalyst/pudl/src/pudl/analysis/eia_ferc1_record_linkage.py", line 998, in check_match_consistency
    raise AssertionError(

zaneselvans · 2023-12-22T22:12:24Z

Hmm, attempting to re-run just the EPA CEMS it seems to do them two-by-two. So maybe it was just that the whole EPA CEMS graph job had failed and left the unstarted ghost assets in the UI.

zaneselvans · 2023-12-26T16:16:44Z

@zschira The builds passed last night so this could go into dev cleanly now.

Do we know why the test coverage drops by half a percent between this PR and dev? That seems like kind of a lot.

Edit: pytest wasn't running the tests in test/integration/record_linkage.py because the name of the module didn't end with _test which is required for test discovery.

zaneselvans · 2023-12-07T22:03:02Z

src/pudl/analysis/record_linkage/classify_plants_ferc1.py

+                                column_transform_from_key("name_cleaner_transform"),
+                                column_transform_from_key("string_transform"),
+                            ],
+                            "weight": 2.0,


Why are there weights for these first two features, but not the rest? Do they default to 1.0?

zaneselvans · 2023-12-22T21:51:15Z

src/pudl/analysis/record_linkage/__init__.py

@@ -0,0 +1 @@
+"""This module impolements models for various forms of record linkage."""


There were some failures in the steam table processing due to pudl.analysis.fuel_by_plant not being imported in pudl/analysis/__init__.py and we have a lot of places where we just import a whole module, rather than the individual functions or constants within it, so I feel like adding the imports here for now would help avoid some confusion with that pattern breaking on some modules.

zschira added 3 commits November 2, 2023 15:42

Update ferc-ferc plant matching with ccai implementation.

6c8a86d

Update docstring

a57810a

Make distance estimator param name more descriptive

688b577

zschira mentioned this pull request Nov 6, 2023

How to handle storage/access to pre-trained model weights #3020

Open

zschira added 2 commits November 9, 2023 16:07

Take only report_years to calculate ferc-ferc distance penalty

1878348

Adjust PCA to use less memory

e7686c3

zschira and others added 7 commits November 17, 2023 13:42

Increase ferc-ferc dist threshold

e526516

Generalize CrossYearLinker interface.

4b5bea4

Merge branch 'dev' into entity_matching

4ccdcc6

Update conda-lock.yml and rendered conda environment files.

586b973

Allow configurable column cleaning in generic inter-year linker

f5e4f14

Remove old classify_plants_ferc1 module

3026681

Merge branch 'entity_matching' of github.com:catalyst-cooperative/pud…

be7f28d

…l into entity_matching

katie-lamb approved these changes Nov 20, 2023

View reviewed changes

src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved

src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved

src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved

katie-lamb requested changes Nov 20, 2023

View reviewed changes

src/pudl/analysis/record_linkage/cross_year.py Outdated Show resolved Hide resolved

src/pudl/analysis/record_linkage/classify_plants_ferc1.py Outdated Show resolved Hide resolved

zschira and others added 6 commits November 20, 2023 18:53

Improve docstring in cross-year-linker

211ff32

Add company name cleaner to ferc-ferc matching

a77c7d7

Add revert_filled_in_nulls back to ferc-ferc match

778d059

fix yaml/yml typo in .gitattributes

44f2fd0

Merge branch 'dev' into entity_matching

b76f2d2

Update conda-lock.yml and rendered conda environment files.

54cb3ca

Add __init__ to record_linkage module.

ead2c8e

katie-lamb and others added 11 commits December 19, 2023 17:40

create new migration

7edc408

Improve fuel fraction test generation

dc9f0bd

Merge branch 'entity_matching' of https://github.com/catalyst-coopera…

0faf725

…tive/pudl into entity_matching

Fix typo

ec34f1e

Refine ferc-ferc model parameters

95bb1ca

Add improved docstring

b844f80

Remove inaccurate docstring

7c1d6c3

Add dedicated module for fuel by plant

c6a9496

Add options to all dataframe embedding steps

9e64c30

Change dict access to get()

12472bc

Simplify ferc plant id verification

90bdc59

katie-lamb reviewed Dec 21, 2023

View reviewed changes

test/integration/record_linkage.py Outdated Show resolved Hide resolved

zschira marked this pull request as ready for review December 22, 2023 17:15

Merge branch 'dev' into entity_matching

171504f

Add missing module imports.

665a805

zaneselvans added 3 commits December 24, 2023 08:42

Merge branch 'dev' into entity_matching

39d05ec

Merge branch 'dev' into entity_matching

1cae6bb

Merge branch 'dev' into entity_matching

09b7c7b

Rename record linkage test module so pytest actually runs it.

962fc3d

zaneselvans self-requested a review December 26, 2023 20:05

zaneselvans approved these changes Dec 26, 2023

View reviewed changes

zaneselvans requested a review from katie-lamb December 26, 2023 20:05

zaneselvans merged commit 5ead5b3 into dev Dec 26, 2023
13 of 15 checks passed

zaneselvans deleted the entity_matching branch December 26, 2023 20:22

zaneselvans added ferc1 Anything having to do with FERC Form 1 record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys. labels Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ferc-ferc plant matching with ccai implementation. #3007

Update ferc-ferc plant matching with ccai implementation. #3007

zschira commented Nov 2, 2023 •

edited by jdangerx

Loading

katie-lamb commented Nov 5, 2023

katie-lamb commented Nov 8, 2023 •

edited

Loading

zaneselvans commented Nov 8, 2023

zschira commented Nov 8, 2023

katie-lamb commented Nov 9, 2023

zaneselvans commented Nov 9, 2023

katie-lamb commented Nov 10, 2023

katie-lamb left a comment •

edited

Loading

katie-lamb left a comment

zaneselvans commented Nov 21, 2023

zaneselvans commented Dec 22, 2023 •

edited

Loading

zaneselvans commented Dec 22, 2023

zaneselvans commented Dec 22, 2023

zaneselvans commented Dec 26, 2023 •

edited

Loading

zaneselvans Dec 7, 2023

zaneselvans Dec 22, 2023

		@@ -0,0 +1 @@
		"""This module impolements models for various forms of record linkage."""

Update ferc-ferc plant matching with ccai implementation. #3007

Update ferc-ferc plant matching with ccai implementation. #3007

Conversation

zschira commented Nov 2, 2023 • edited by jdangerx Loading

Remaining work:

katie-lamb commented Nov 5, 2023

katie-lamb commented Nov 8, 2023 • edited Loading

zaneselvans commented Nov 8, 2023

zschira commented Nov 8, 2023

katie-lamb commented Nov 9, 2023

zaneselvans commented Nov 9, 2023

katie-lamb commented Nov 10, 2023

katie-lamb left a comment • edited Loading

Choose a reason for hiding this comment

katie-lamb left a comment

Choose a reason for hiding this comment

zaneselvans commented Nov 21, 2023

zaneselvans commented Dec 22, 2023 • edited Loading

zaneselvans commented Dec 22, 2023

zaneselvans commented Dec 22, 2023

zaneselvans commented Dec 26, 2023 • edited Loading

zaneselvans Dec 7, 2023

Choose a reason for hiding this comment

zaneselvans Dec 22, 2023

Choose a reason for hiding this comment

zschira commented Nov 2, 2023 •

edited by jdangerx

Loading

katie-lamb commented Nov 8, 2023 •

edited

Loading

katie-lamb left a comment •

edited

Loading

zaneselvans commented Dec 22, 2023 •

edited

Loading

zaneselvans commented Dec 26, 2023 •

edited

Loading