Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IDRT connector for duplicate contact detection #934

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Jason94
Copy link
Collaborator

@Jason94 Jason94 commented Nov 29, 2023

This PR adds a connector for the IDRT (Identity Resolution Transformer) library in Parsons. IDRT is an open-source library that I wrote (https://github.com/Jason94/identity-resolution) to use neural networks to match duplicate contacts in a database.

If any reviewers want to test the code, they can use these model files to run the example scripts in the documentation: models.zip

The connector does provide easy function calls into the two steps of the main algorithm that the library exposes. This algorithm is designed to run directly against a large dataset of contacts stored in a database (Redshift, BigQuery, etc). It makes use of the database during several intermediate steps to reduce execution time.

The connector does not provide an easy way to quickly match against a Parsons Table containing contact data. The focus of the library is to do this at-scale, so that's where the current focus is. I'd like to add another function at some point that is simpler, and just takes a Parsons table of contact data and some basic configuration and does a match search among the rows of the table. If reviewers think that is likely to be a common use-case, I can add it to the PR before merging.

The connector also does not provide any ways to train the neural networks. That is a much more advanced task than using an existing model, and I didn't see how adding anything to Parsons would make that any easier.

Notes for reviewers:

  1. The IDRT library pulls in some pretty hefty deep learning libraries as dependencies. I really did not want to add those as default dependencies to Parsons. I noticed that anything listed as a Parsons "extra" still gets installed by default, if you don't have the limited dependencies option turned on. I modified the Parsons dependency mechanism in the setup.py file to allow truly optional dependencies that must be explicitly installed. In this case, by running pip install parsons[idr].
  2. The code is pretty lightweight, all things considered. It's mostly wrapping the calls to the library function in our standard environment variable conventions and providing documentation. The library uses PETL, so it's easy to convert to and from a Parsons Table.
  3. The connector includes an adapter to use any Parsons DatabaseConnector with the algorithm. It does check to make sure that the upsert function is defined on the database object, which currently isn't standard in the DatabaseConnector interface. It should currently work for Redshift and BigQuery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants