Skip to content

Enormously incorrect m probability; how can I troubleshoot? #1434

Answered by NickCrews
jmacak-at-dl asked this question in Q&A
Discussion options

You must be logged in to vote

Im looking at

and what it looks like to me is that the number of records that match the LastName jaro_winkler_similarity >= 0.9 level is very small, both for true-matches and non-matches. This could lead to outlier error. For example if there is only one record among true matches where this happens, and 10 records among non-matches, then this makes the model think that if it sees this level, it is very indicative of a non-match. If you had more records then this ratio might not be as skewed.

Take a look at the table of comparisons that comes out of .predict(). Filter this so you only look at comparisons where eg gamma_last_name == 3 (except use the right column name, and it might not be …

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@RobinL
Comment options

@jmacak-at-dl
Comment options

Answer selected by jmacak-at-dl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants