You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cluster IDs in splink used to be integers and are now (v3) the min ID in the cluster (e.g. dataset_a-__-0001 for the cluster containing records 0001 and 0004 from dataset_a, and several records from dataset_b and dataset_c).
This change itself is an improvement in terms of the same cluster often being assigned the same ID when re-linked in future. However, it can be misleading that a node label is used as a cluster ID, especially as that node is not necessarily representative of the cluster as a whole.
Describe the solution you'd like
Once more graph properties have been included in the cluster generation outputs, node centrality or node degree would be relatively simple metrics for defining the "central" node after which the cluster should be named.
Low threshold
Medium threshold
High threshold
Cluster ID: 1014014
Cluster ID: 1014014
Cluster ID: 1015600
Node ID (max degree): 1015600
Node ID (max degree): 1015600
Node ID (max degree): 1015600
Currently, cluster ID is subject to change as nodes are added or removed, dependent only on the alphabetical ordering of the IDs. The most central node, however, should be least susceptible to change and therefore be a more consistent basis of the cluster ID.
Is your proposal related to a problem?
Related to discussion in https://github.com/moj-analytical-services/data_linking/issues/329 and #1677
Cluster IDs in splink used to be integers and are now (v3) the min ID in the cluster (e.g.
dataset_a-__-0001
for the cluster containing records0001
and0004
fromdataset_a
, and several records fromdataset_b
anddataset_c
).This change itself is an improvement in terms of the same cluster often being assigned the same ID when re-linked in future. However, it can be misleading that a node label is used as a cluster ID, especially as that node is not necessarily representative of the cluster as a whole.
Describe the solution you'd like
Once more graph properties have been included in the cluster generation outputs, node centrality or node degree would be relatively simple metrics for defining the "central" node after which the cluster should be named.
1014014
1014014
1015600
1015600
1015600
1015600
Currently, cluster ID is subject to change as nodes are added or removed, dependent only on the alphabetical ordering of the IDs. The most central node, however, should be least susceptible to change and therefore be a more consistent basis of the cluster ID.
Describe alternatives you've considered
See discussion here - https://github.com/moj-analytical-services/data_linking/issues/329
The text was updated successfully, but these errors were encountered: