Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
Related to #2562 and #251.
Give a brief description for the solution you have provided
There is some discussion of the method in #2562. This PR implements an alternative to
cluster_pairwise_predictions_at_threshold
which is calledcluster_using_single_best_links
(more exciting name suggestions welcome).The goal of this clustering method is to produce clusters where for each cluster and for each dataset in the
source_datasets
list at most one record from that dataset can be in the cluster. To do this, at each iteration it only accepts a link if it is the single best link for the left id and the right id, and if accepting that link will not create any duplicates.To deal with ties (e.g. where A1 links to B1 and to B2 with the same match probability) this implementation uses
row_number
rather thanrank
which arbitrarily (note: not randomly) picks one of the edges tied for first. A good extension on this would be to implement some other options for dealing with ties, but I think in most Splink applications there are very few ties, especially when term frequency adjustments are being used.PR Checklist