Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Cluster IDs based on node centrality #1720

Open
samnlindsay opened this issue Nov 9, 2023 · 0 comments
Open

[FEAT] Cluster IDs based on node centrality #1720

samnlindsay opened this issue Nov 9, 2023 · 0 comments
Labels
clustering enhancement New feature or request

Comments

@samnlindsay
Copy link
Contributor

Is your proposal related to a problem?

Related to discussion in https://github.com/moj-analytical-services/data_linking/issues/329 and #1677

Cluster IDs in splink used to be integers and are now (v3) the min ID in the cluster (e.g. dataset_a-__-0001 for the cluster containing records 0001 and 0004 from dataset_a, and several records from dataset_b and dataset_c).

This change itself is an improvement in terms of the same cluster often being assigned the same ID when re-linked in future. However, it can be misleading that a node label is used as a cluster ID, especially as that node is not necessarily representative of the cluster as a whole.

Describe the solution you'd like

Once more graph properties have been included in the cluster generation outputs, node centrality or node degree would be relatively simple metrics for defining the "central" node after which the cluster should be named.

Low threshold Medium threshold High threshold
Cluster ID: 1014014 Cluster ID: 1014014 Cluster ID: 1015600
Node ID (max degree): 1015600 Node ID (max degree): 1015600 Node ID (max degree): 1015600

Currently, cluster ID is subject to change as nodes are added or removed, dependent only on the alphabetical ordering of the IDs. The most central node, however, should be least susceptible to change and therefore be a more consistent basis of the cluster ID.

Describe alternatives you've considered

See discussion here - https://github.com/moj-analytical-services/data_linking/issues/329

@samnlindsay samnlindsay added enhancement New feature or request clustering labels Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clustering enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant