Enormously incorrect m probability; how can I troubleshoot? #1434

jmacak-at-dl · 2023-07-12T19:41:56Z

jmacak-at-dl
Jul 12, 2023

My apologies for being new and inexperienced; I'm going to try to give as much relevant information as possible, but I think I'm going to have to rely on some back-and-forth from all of you to get to the heart of the issue I'm dealing with.

To begin, my task is to link two files with the following columns:

FirstName, MiddleInitial, LastName, Suffix,
Address, City, State, Zip,
Email, Phone, Dob

My issue is seemingly coming from the parameter estimation phase. The LastName jaro_winkler_similarity >= 0.9 parameter was given an extremely negative weight (-610):

If I zoom out, you can see the scale of the estimation was huge:

The Last Name column doesn't seem to have a significant skew in the data (none of the other columns do, either):

So I'm not sure if there's an issue with my code, with the blocking rules I chose, or with the underlying data. The last time I ran splink on this dataset, I used different training rules for blocking the m probabilities and ended up with a model that wasn't fully trained, but also didn't have this error, so it seems like it might be my blocking rules. (and, I guess, the data that is returned by those blocking rules) For what it's worth, that previous run did return 9M rows that seemed to be clustered together well.

Here's my code; the parameter estimation begins on line 93. Is there other information I can provide that might help?

import duckdb
import pandas as pd
import altair as alt
import duckdb
from splink.duckdb.linker import DuckDBLinker

# import and setup
con = duckdb.connect(database='ailments_test', read_only=False)
con.execute("CREATE TABLE may23 AS SELECT * FROM './may.csv'")
con.execute("CREATE TABLE june23 AS SELECT * FROM './june.csv'")

match_cols = ['EMAIL','FIRST_NAME','LAST_NAME','ADDRESS','CITY','STATE','ZIP','PHONE','IP','DOB','SOURCE','ROW_ID','SUFFIX','MIDDLE_INITIAL']

query = "CREATE OR REPLACE TEMPORARY TABLE tMay AS SELECT {} FROM may23 USING SAMPLE 10%".format(", ".join(match_cols))
con.execute(query)

query = "CREATE OR REPLACE TEMPORARY TABLE tJune AS SELECT {} FROM june23 USING SAMPLE 10%".format(", ".join(match_cols))
con.execute(query)

settings = {
    'unique_id_column_name': 'ROW_ID',
    'link_type': 'link_only'
}

linker = DuckDBLinker(['tMay','tJune'], settings, connection=con)

# Exploratory analysis
linker.missingness_chart()
linker.profile_columns(match_cols)

# Blocking
def print_blocking_comparison_count (blocking_rule):
    count = linker.count_num_comparisons_from_blocking_rule(blocking_rule)
    print(f"Number of comparisons generated by '{blocking_rule}': {count:,.0f}")

blocking_first_name_address = 'l.FIRST_NAME = r.FIRST_NAME and l.ADDRESS = r.ADDRESS'
blocking_last_name_address = 'l.LAST_NAME = r.LAST_NAME and l.ADDRESS = r.ADDRESS'
blocking_email = 'l.EMAIL = r.EMAIL'
blocking_phone = 'l.PHONE = r.PHONE'
blocking_address = 'l.ADDRESS = r.ADDRESS'

blocking_rule_list = [
    blocking_first_name_address,
    blocking_last_name_address,
    blocking_email,
    blocking_phone,
    blocking_address
]

for rule in blocking_rule_list:
    print_blocking_comparison_count(rule)

linker.cumulative_num_comparisons_from_blocking_rules_chart(blocking_rule_list)

# Defining comparisons
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl

first_name_comparison = ctl.name_comparison('FIRST_NAME')
middle_initial_comparison = cl.exact_match('MIDDLE_INITIAL')
last_name_comparison = ctl.name_comparison('LAST_NAME')
name_combo_comparison = ctl.forename_surname_comparison('FIRST_NAME','LAST_NAME')
address_comparison = cl.jaro_winkler_at_thresholds('ADDRESS')
phone_comparison = cl.jaro_winkler_at_thresholds('PHONE')
email_comparison = ctl.email_comparison('EMAIL')
dob_comparison = ctl.date_comparison('DOB', 
                                     cast_strings_to_date=True,
                                     date_format = "%Y/%m/%d",
                                     invalid_dates_as_null=True)

comparison_list = [
    first_name_comparison,
    last_name_comparison,
    phone_comparison,
    dob_comparison,
    address_comparison,
    email_comparison,
    name_combo_comparison,
    middle_initial_comparison
]

settings = {
    'unique_id_column_name': 'ROW_ID',
    'link_type': 'link_only',
    'comparisons': comparison_list,
    'blocking_rules_to_generate_predictions': blocking_rule_list,
    'retain_matching_columns': True,
    'retain_intermediate_calculation_columns': True
}

linker = DuckDBLinker(['tMay','tJune'], settings, connection=con)

# Estimating model parameters
deterministic_rules = [
    'l.FIRST_NAME = r.FIRST_NAME and l.ADDRESS = r.ADDRESS',
    'l.LAST_NAME = r.LAST_NAME and l.ADDRESS = r.ADDRESS',
    'l.FIRST_NAME = r.FIRST_NAME and l.LAST_NAME = r.LAST_NAME',
    'l.email = r.email'
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7) # recall = What percentage of true matches fit the rules above?
linker.estimate_u_using_random_sampling(max_pairs=5e6)

training_blocking_first_name_last_name = ('l.FIRST_NAME = r.FIRST_NAME and l.LAST_NAME = r.LAST_NAME')
training_session_first_name_last_name = linker.estimate_parameters_using_expectation_maximisation(training_blocking_first_name_last_name)

training_blocking_phone = 'l.PHONE = r.PHONE'
training_session_phone = linker.estimate_parameters_using_expectation_maximisation(training_blocking_phone)

training_blocking_first_name_address = 'l.FIRST_NAME = r.FIRST_NAME and l.ADDRESS = r.ADDRESS'
training_session_first_name_address = linker.estimate_parameters_using_expectation_maximisation(training_blocking_first_name_address)

linker.match_weights_chart()
linker.m_u_parameters_chart()

Answered by NickCrews

Jul 14, 2023

Im looking at

and what it looks like to me is that the number of records that match the LastName jaro_winkler_similarity >= 0.9 level is very small, both for true-matches and non-matches. This could lead to outlier error. For example if there is only one record among true matches where this happens, and 10 records among non-matches, then this makes the model think that if it sees this level, it is very indicative of a non-match. If you had more records then this ratio might not be as skewed.

Take a look at the table of comparisons that comes out of .predict(). Filter this so you only look at comparisons where eg gamma_last_name == 3 (except use the right column name, and it might not be …

View full answer

NickCrews · 2023-07-14T18:34:05Z

NickCrews
Jul 14, 2023

Im looking at

and what it looks like to me is that the number of records that match the LastName jaro_winkler_similarity >= 0.9 level is very small, both for true-matches and non-matches. This could lead to outlier error. For example if there is only one record among true matches where this happens, and 10 records among non-matches, then this makes the model think that if it sees this level, it is very indicative of a non-match. If you had more records then this ratio might not be as skewed.

Take a look at the table of comparisons that comes out of .predict(). Filter this so you only look at comparisons where eg gamma_last_name == 3 (except use the right column name, and it might not be 3). How many comparisons are there? Do they look sane?

In general, if the number of records in each level is too small, I try to adjust the specificity of each rule so that the comparisons are more equally distributed between levels. So for instance I might just remove the LastName jaro_winkler_similarity >= 0.9 level, and then that will get absorbed into the LastName jaro_winkler_similarity >= 0.8 level. I guess this habit isn't really based on hard evidence, but my intuition is that it gives the model more discriminating power: with "super-levels" like your "all other comparisons" that seem to match both a large portion of matches and a large portion of non-matches, the model isn't really able to draw much conclusion from seeing that level. You want to make the condition more targeted, so that non-matches fall in one level, and matches fall in another, so that the model can say "oh when I see this level it actually is good evidence this is a match".

2 replies

RobinL Jul 14, 2023
Maintainer

One additional point is that you shouldn't have both the full name comparison and the first name and last name comparison, as that will double count the information in the name. It could also interfere with parameter training, although I suspect the reason for the extreme values is the small sample issue that Nick describes above.

In general I would recommend starting with less complex model (fewer comparison levels) and building it up

jmacak-at-dl Jul 17, 2023
Author

Thank you for your responses! The small sample size sounds right, and as I'm running this script on different sets of data, I can see that sometimes it appears, and sometimes it doesn't. I'd expect that to be the case since I'm using random sampling to create these datasets.

I don't see anything strange about the records where gamma_LAST_NAME == 2, which is the Jaro-Winkler > 0.9 level. It's the stuff you would expect; names are caught on this level for things like hyphenation (Maguire vs Maguire-King), substrings (Lamb vs Lambert), or just being similar (Chatters vs Chavers). There are 228 records in the table, and they seem to be pretty evenly split between looks-like-a-match and definitely-not-a-match.

But I'm satisfied enough to say "it's a small sample issue" and move on from there, with the advice the two of you provided. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enormously incorrect m probability; how can I troubleshoot? #1434

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Enormously incorrect m probability; how can I troubleshoot? #1434

jmacak-at-dl Jul 12, 2023

Replies: 1 comment · 2 replies

NickCrews Jul 14, 2023

RobinL Jul 14, 2023 Maintainer

jmacak-at-dl Jul 17, 2023 Author

jmacak-at-dl
Jul 12, 2023

Replies: 1 comment 2 replies

NickCrews
Jul 14, 2023

RobinL Jul 14, 2023
Maintainer

jmacak-at-dl Jul 17, 2023
Author