Matcher improvements / regression-v3 #163

pudo · 2024-07-25T15:31:03Z

Overview

We probably want to retire regression-v2 relatively quickly: it's never been used much and doesn't produce amazing results.
Addresses are an area we are really not doing well in. I don't even know what shape a plan would take there - serious address parsing/normalisation would require adopting something like libpostal, but that's just a prohibitively large dependency to adopt. Perhaps we can have some marginal gains by using partial matching based on token alignment (see below)

Back-porting things from logic-v1

We've done a lot of work on logic-v1 that isn't yet being employed by the regression matchers. In particular around identifiers and name matching. But there's a caveat: some of the logic-v1 matchers are "asymmetric" now - they handle the query and match candidate arguments differently, so using them for dedupe doesn't make sense.

We've got a lot of "name alignment" in the logic-v1 name matchers. This means a comparison between "smith, john" and "john smith" gets re-sorted by doing pairwise string distance on the tokens and then doing an overall string distance based on the aligned names (cf. nomenklatura.matching.compare.names:_align_name_parts).
We have a bunch of matchers for specific, strongly-typed identifiers (like INNs, OGRNs, SWIFT BICs, ISINs etc.) in nomenklatura.matching.compare.identifiers. It would be fun to see if these want to be regression features.

Specific stuff for de-dupe

Regression features seem to be working best when they're fully independent. So we have to be careful with stuff like having features both for INNs and for identifiers in general, that sort of react to the same signal in the data on some level. I also think that overlap in "signal" is what makes our DOB matching be really crap.
We should try if having a countries_overlap as well as a countries_disjoint is a good idea.
Birthdates/start dates matching in reg-v1 is deeply broken. We probably want to have both negative and positive features here, too, and both of them for year-only and day-precision.

Final thoughts

This all needs to be fast. lol.

The text was updated successfully, but these errors were encountered:

pudo added the enhancement label Jul 25, 2024

pudo assigned jbothma Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matcher improvements / regression-v3 #163

Matcher improvements / regression-v3 #163

pudo commented Jul 25, 2024 •

edited

Loading

Matcher improvements / regression-v3 #163

Matcher improvements / regression-v3 #163

Comments

pudo commented Jul 25, 2024 • edited Loading

Overview

Back-porting things from logic-v1

Specific stuff for de-dupe

Final thoughts

pudo commented Jul 25, 2024 •

edited

Loading