Add IDRT connector for duplicate contact detection #934
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a connector for the IDRT (Identity Resolution Transformer) library in Parsons. IDRT is an open-source library that I wrote (https://github.com/Jason94/identity-resolution) to use neural networks to match duplicate contacts in a database.
If any reviewers want to test the code, they can use these model files to run the example scripts in the documentation: models.zip
The connector does provide easy function calls into the two steps of the main algorithm that the library exposes. This algorithm is designed to run directly against a large dataset of contacts stored in a database (Redshift, BigQuery, etc). It makes use of the database during several intermediate steps to reduce execution time.
The connector does not provide an easy way to quickly match against a Parsons Table containing contact data. The focus of the library is to do this at-scale, so that's where the current focus is. I'd like to add another function at some point that is simpler, and just takes a Parsons table of contact data and some basic configuration and does a match search among the rows of the table. If reviewers think that is likely to be a common use-case, I can add it to the PR before merging.
The connector also does not provide any ways to train the neural networks. That is a much more advanced task than using an existing model, and I didn't see how adding anything to Parsons would make that any easier.
Notes for reviewers:
setup.py
file to allow truly optional dependencies that must be explicitly installed. In this case, by runningpip install parsons[idr]
.DatabaseConnector
with the algorithm. It does check to make sure that theupsert
function is defined on the database object, which currently isn't standard in the DatabaseConnector interface. It should currently work for Redshift and BigQuery.