"strict" blocking #2507

noah-dbc · 2024-11-12T13:07:40Z

noah-dbc
Nov 12, 2024

I am new to this framework, and I am excited to get started. I have followed the tutorial, and I have a question. I am sorry if it has already been asked and answered before, but I was unable to find it.
In our data, we have a field where we know that if two records have separate values for that field, then they should not be marked as duplicates, but if they have the same value, it is not sufficient to conclude that they should. Is there a way to express that in blocking rules, or should it be done in another way?

Kind regards,
/Noah

Answered by RobinL

Nov 12, 2024

This would typically be done with match weights. You could fix them for that particular field, see:
#2379

Basically you want a very strong negative match weight for the 'does not match' case.
Note when you fix the values you don't need to worry about them summing to one.

So in your case:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000


settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.E…

View full answer

RobinL · 2024-11-12T14:13:07Z

RobinL
Nov 12, 2024
Maintainer

This would typically be done with match weights. You could fix them for that particular field, see:
#2379

Basically you want a very strong negative match weight for the 'does not match' case.
Note when you fix the values you don't need to worry about them summing to one.

So in your case:

import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets

db_api = DuckDBAPI()

df = splink_datasets.fake_1000


settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.CustomComparison(
            comparison_levels=[
                cll.NullLevel("email"),
                cll.ExactMatchLevel("email").configure(
                    m_probability=0.99,  # You could fix these or not bother
                    u_probability=0.01,  # You could fix these or not bother
                    fix_m_probability=True, # You could fix these or not bother
                    fix_u_probability=True, # You could fix these or not bother
                ),
                cll.ElseLevel().configure(
                    m_probability=1e-7,  #This is the strong negative weight that needs to be fixed. make this value even close to 0 for even stronger negative mathch weight
                    u_probability=1.00, #This is the strong negative weight that needs to be fixed
                    fix_m_probability=True, #This is the strong negative weight that needs to be fixed
                    fix_u_probability=True, #This is the strong negative weight that needs to be fixed
                ),
            ],
            output_column_name="email",
        ),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    max_iterations=2,
)

linker = Linker(df, settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")], recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))

linker.visualisations.match_weights_chart()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"strict" blocking #2507

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

"strict" blocking #2507

noah-dbc Nov 12, 2024

Replies: 1 comment

RobinL Nov 12, 2024 Maintainer

noah-dbc
Nov 12, 2024

RobinL
Nov 12, 2024
Maintainer