Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matcher improvements / regression-v3 #163

Open
pudo opened this issue Jul 25, 2024 · 0 comments
Open

Matcher improvements / regression-v3 #163

pudo opened this issue Jul 25, 2024 · 0 comments
Assignees

Comments

@pudo
Copy link
Member

pudo commented Jul 25, 2024

Overview

  • We probably want to retire regression-v2 relatively quickly: it's never been used much and doesn't produce amazing results.
  • Addresses are an area we are really not doing well in. I don't even know what shape a plan would take there - serious address parsing/normalisation would require adopting something like libpostal, but that's just a prohibitively large dependency to adopt. Perhaps we can have some marginal gains by using partial matching based on token alignment (see below)

Back-porting things from logic-v1

We've done a lot of work on logic-v1 that isn't yet being employed by the regression matchers. In particular around identifiers and name matching. But there's a caveat: some of the logic-v1 matchers are "asymmetric" now - they handle the query and match candidate arguments differently, so using them for dedupe doesn't make sense.

  • We've got a lot of "name alignment" in the logic-v1 name matchers. This means a comparison between "smith, john" and "john smith" gets re-sorted by doing pairwise string distance on the tokens and then doing an overall string distance based on the aligned names (cf. nomenklatura.matching.compare.names:_align_name_parts).
  • We have a bunch of matchers for specific, strongly-typed identifiers (like INNs, OGRNs, SWIFT BICs, ISINs etc.) in nomenklatura.matching.compare.identifiers. It would be fun to see if these want to be regression features.

Specific stuff for de-dupe

  • Regression features seem to be working best when they're fully independent. So we have to be careful with stuff like having features both for INNs and for identifiers in general, that sort of react to the same signal in the data on some level. I also think that overlap in "signal" is what makes our DOB matching be really crap.
  • We should try if having a countries_overlap as well as a countries_disjoint is a good idea.
  • Birthdates/start dates matching in reg-v1 is deeply broken. We probably want to have both negative and positive features here, too, and both of them for year-only and day-precision.

Final thoughts

This all needs to be fast. lol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants