Cluster comparison to replace old clusters with new #1218

keesbosch1996 · 2023-05-03T08:34:18Z

keesbosch1996
May 3, 2023

Hi, I think this should be a problem that many of you encounter but I have not found a clear answer yet.
When running Splink record linkage is a monthly process, the clusters will change in different versions (records are added to clusters or clusters might split up). However because your platform is running on the data with the old clusters that have certain cluster_ids, you can not simply replace the clusters with the new clusters. Is there a way to check for similarity in the clusters between the two versions of the dataset and preserve the old ID's and attach them to the new clusters?
Some questions that come with this topic are;

How do you handle deprecated clusters?
What happens if clusters are split up?
What happens if two clusters are merged?

RobinL · 2023-05-15T06:12:33Z

RobinL
May 15, 2023
Maintainer

There is currently no functionality built into Splink to analyse changes in clusters - but I can see why it would be valuable.

There are a few features of the clustering algorithm which may be of interest, though, to help you analyse changes:

The cluster ID assigned to a cluster is the minimum of the unique IDs within the cluster. So if you cluster has nodes B, C and D, it will be assigned cluster ID B.

This means that if a new record with ID A is added to your dataset, which Splink estimates to be the same person as the existing BCD cluster, then the cluster's unique ID will change to A.

The clustering algorithm used is known as connected components. Assuming
(1) a fixed Splink model (i.e. no retraining of parameters)
(2) records are only added to the datasets and not removed and
(3) existing records do not change,

Then clusters can only ever become bigger. They can never be split apart. 'Joining' can mean a single node joins (e.g. A, in the example above), or two or more existing clusters are joined (e.g. a cluster BCD is joined to another cluster XYZ by a new node that e.g. joins to both B and X).

This may hopefully help you analyse clusters. One way you can analyse the clusters yourself is to use query_sql, see here.

We will certainly consider adding functionality to compare clusters since I can see that it would be generally useful. If you have a moment, I'd be grateful for any thoughts about specifically what you'd be looking for from such functionality.

1 reply

mastratton3 May 18, 2023

I'm also very interested in better understanding the options here. I built a record pipeline/process before I found splink and have a way of enforcing cluster ids but it's starting to become an ops burden and I'd like to improve the system.

A couple of specific cases:

In theory the model should change. I have some manual labelers and I intend to use that to improve the model.
Records won't be removed, however if the model improves and finds some false positives, I'd like to split those clusters downstream.

Maybe the answer is more "documentation" than technical as each has tradeoffs, but I've gotten pretty deep in the rabbit hole of trying to find ways to handle this.

wpfl-dbt · 2023-07-03T08:31:51Z

wpfl-dbt
Jul 3, 2023

@keesbosch1996 did you ever end up implementing Robin's suggestion? I'm restricted to DuckDB and while I can predict just fine, connected components runs out of memory. I want to try batching the clusters then matching up the batches.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster comparison to replace old clusters with new #1218

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Cluster comparison to replace old clusters with new #1218

keesbosch1996 May 3, 2023

Replies: 2 comments · 1 reply

RobinL May 15, 2023 Maintainer

mastratton3 May 18, 2023

wpfl-dbt Jul 3, 2023

keesbosch1996
May 3, 2023

Replies: 2 comments 1 reply

RobinL
May 15, 2023
Maintainer

wpfl-dbt
Jul 3, 2023