Improve compare two records #2498

RobinL · 2024-11-07T14:50:20Z

This PR:

Substantially improves the performance of compare_two_records
Adds new features:

Example code demoing new features here

I have left the realtime module private for now i.e. essentially it's 'beta functionality'. I think we should start using this ourselves internally for a bit to make sure we're happy with it, before putting in the public api.

When using compare_two_records, you previously had to precompute term frequency tables. You can now explicitly pass in term frequency adjustments in the input data e.g. a column called tf_city with hardcoded value
When using compare_two_records, you previously could ONLY pass two records as dictionaries. You can now pass typed dataframes, e.g. a duckdbpyrelation if it’s a DuckDB linker. This fixes a bug whereby previously it was not possible to run compare_two_records with complex data types e.g. a date type column
- it's now possible to pass more than one record. If the user provides several records, scores for the cartesian product are computed
compare_two_records previously did not produce a found_by_blocking_rules column, this is now (optionally) outputted)
compare_two_records mode previously was slow for two reasons:
- it ran an excessive amount of code, including blocking code that is unnecessary
- The time taken to generate the SQL was longer than the time taken to execute it.
I’ve now included beta functionality of running compare_two_records without a linker, which will provisionally be at splink.internals.realtime.compare_records. This ‘flavour’ of the function is deliiberately design for high performance (i.e. Core Person Record). It assumes (i.e. does not double check) that all data is present in the input records, INCLUDING term frequency adjustments. It memoizes generated SQL code so that it’s extremely fast (0.001 seconds for a pariwise prediction, which is at least 10x faster. Previously most of the time taken was generating code)
When these functions are called, the retain_intermediate_calculation_columns and retain_matching_columns settings are always set to True irrespective of the settings in the main settings object. This means users can e.g. run predict using efficient settings, but generate a handful of waterfalls using this function

Supercedes #2426

RobinL · 2024-11-12T09:03:12Z

splink/internals/linker_components/inference.py

            df_records_right,
            pipeline,
            in_tablename="__splink__compare_two_records_right_with_tf",
            out_tablename="__splink__compare_two_records_right_with_tf_uid_fix",
            uid_str="_right",
        )

-        sqls = block_using_rules_sqls(


If we're using cartesian blocking, we don't need to run any complex blocking code.

In addition, this code creates and materilises a list of pairwise Ids, which is uses for the join. This is unnecessary in the context of a handful of records

RobinL · 2024-11-12T09:04:09Z

splink/internals/term_frequencies.py


    This is needed e.g. when using linker.compare_two_records
    or linker.inference.find_matches_to_new_records in which the user provides
    new records which need tf adjustments computed
    """
+    tf_cols_already_populated = []


The user can now include tf columns in the input data. If they do, do not recompute, and use whatever the user has provided

ADBond

Great addition, all seems reasonable 👍

ADBond · 2024-11-12T17:20:46Z

splink/internals/realtime.py

+    settings_obj._retain_matching_columns = retain_matching_columns
+    settings_obj._retain_intermediate_calculation_columns = (
+        retain_intermediate_calculation_columns
+    )


We probably don't need this in this case, as we only instantiate the Settings in this function

RobinL and others added 18 commits November 7, 2024 14:47

improve compare two records

ee31efd

works with tf columns

511a644

add test of compare two records

0bc58f4

maintain compat with previous code

704c68e

add real time

0df455b

first attempt at realtime

ff6b7c5

remove double call

c32d2f9

caching seems to work

ada32cc

test realtime

20345f0

test with a pd merge

7ab2241

3.8 support

20a39f6

3.8 support

0bed158

date types in sqlite don't work

e91d0da

add with datetypes

1650a2b

allow found by blocking rules

a495811

hardcode values so they work across backends

f128790

fix mypy issues

69049a1

make the new realtime functions private

8553393

RobinL changed the title ~~(WIP) improve compare two records~~ Improve compare two records Nov 12, 2024

RobinL commented Nov 12, 2024

View reviewed changes

RobinL requested a review from ADBond November 12, 2024 09:16

RobinL and others added 3 commits November 12, 2024 09:41

fix mypy issues

e045354

improve caching implementation

ef682b1

mypy

eddbdb6

ADBond approved these changes Nov 12, 2024

View reviewed changes

don't need to remember settings since they're not saved

5e9a69b

RobinL merged commit fff3433 into master Nov 13, 2024
25 checks passed

RobinL deleted the improve_compare_two_records branch November 13, 2024 17:32

RobinL mentioned this pull request Dec 14, 2024

[BUG] compare_two_records fails in Spark if some values are None #2423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compare two records #2498

Improve compare two records #2498

RobinL commented Nov 7, 2024 •

edited

Loading

RobinL Nov 12, 2024

RobinL Nov 12, 2024

ADBond left a comment

ADBond Nov 12, 2024

Improve compare two records #2498

Improve compare two records #2498

Conversation

RobinL commented Nov 7, 2024 • edited Loading

RobinL Nov 12, 2024

Choose a reason for hiding this comment

RobinL Nov 12, 2024

Choose a reason for hiding this comment

ADBond left a comment

Choose a reason for hiding this comment

ADBond Nov 12, 2024

Choose a reason for hiding this comment

RobinL commented Nov 7, 2024 •

edited

Loading