Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One to one clustering #2578

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

aymonwuolanne
Copy link
Contributor

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Related to #2562 and #251.

Give a brief description for the solution you have provided

There is some discussion of the method in #2562. This PR implements an alternative to cluster_pairwise_predictions_at_threshold which is called cluster_using_single_best_links (more exciting name suggestions welcome).

The goal of this clustering method is to produce clusters where for each cluster and for each dataset in the source_datasets list at most one record from that dataset can be in the cluster. To do this, at each iteration it only accepts a link if it is the single best link for the left id and the right id, and if accepting that link will not create any duplicates.

To deal with ties (e.g. where A1 links to B1 and to B2 with the same match probability) this implementation uses row_number rather than rank which arbitrarily (note: not randomly) picks one of the edges tied for first. A good extension on this would be to implement some other options for dealing with ties, but I think in most Splink applications there are very few ties, especially when term frequency adjustments are being used.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter
  • Run the spellchecker (if appropriate)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant