Cluster comparison to replace old clusters with new #1218
Replies: 2 comments 1 reply
-
There is currently no functionality built into Splink to analyse changes in clusters - but I can see why it would be valuable. There are a few features of the clustering algorithm which may be of interest, though, to help you analyse changes: The cluster ID assigned to a cluster is the minimum of the unique IDs within the cluster. So if you cluster has nodes B, C and D, it will be assigned cluster ID B. This means that if a new record with ID A is added to your dataset, which Splink estimates to be the same person as the existing BCD cluster, then the cluster's unique ID will change to A. The clustering algorithm used is known as connected components. Assuming Then clusters can only ever become bigger. They can never be split apart. 'Joining' can mean a single node joins (e.g. A, in the example above), or two or more existing clusters are joined (e.g. a cluster BCD is joined to another cluster XYZ by a new node that e.g. joins to both B and X). This may hopefully help you analyse clusters. One way you can analyse the clusters yourself is to use We will certainly consider adding functionality to compare clusters since I can see that it would be generally useful. If you have a moment, I'd be grateful for any thoughts about specifically what you'd be looking for from such functionality. |
Beta Was this translation helpful? Give feedback.
-
@keesbosch1996 did you ever end up implementing Robin's suggestion? I'm restricted to DuckDB and while I can predict just fine, connected components runs out of memory. I want to try batching the clusters then matching up the batches. |
Beta Was this translation helpful? Give feedback.
-
Hi, I think this should be a problem that many of you encounter but I have not found a clear answer yet.
When running Splink record linkage is a monthly process, the clusters will change in different versions (records are added to clusters or clusters might split up). However because your platform is running on the data with the old clusters that have certain cluster_ids, you can not simply replace the clusters with the new clusters. Is there a way to check for similarity in the clusters between the two versions of the dataset and preserve the old ID's and attach them to the new clusters?
Some questions that come with this topic are;
Beta Was this translation helpful? Give feedback.
All reactions