Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compare two records #2498

Merged
merged 22 commits into from
Nov 13, 2024
Merged

Improve compare two records #2498

merged 22 commits into from
Nov 13, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Nov 7, 2024

This PR:

  • Substantially improves the performance of compare_two_records
  • Adds new features:

Example code demoing new features here

I have left the realtime module private for now i.e. essentially it's 'beta functionality'. I think we should start using this ourselves internally for a bit to make sure we're happy with it, before putting in the public api.

  • When using compare_two_records, you previously had to precompute term frequency tables. You can now explicitly pass in term frequency adjustments in the input data e.g. a column called tf_city with hardcoded value

  • When using compare_two_records, you previously could ONLY pass two records as dictionaries. You can now pass typed dataframes, e.g. a duckdbpyrelation if it’s a DuckDB linker. This fixes a bug whereby previously it was not possible to run compare_two_records with complex data types e.g. a date type column

    • it's now possible to pass more than one record. If the user provides several records, scores for the cartesian product are computed
  • compare_two_records previously did not produce a found_by_blocking_rules column, this is now (optionally) outputted)

  • compare_two_records mode previously was slow for two reasons:

    • it ran an excessive amount of code, including blocking code that is unnecessary
    • The time taken to generate the SQL was longer than the time taken to execute it.
  • I’ve now included beta functionality of running compare_two_records without a linker, which will provisionally be at splink.internals.realtime.compare_records. This ‘flavour’ of the function is deliiberately design for high performance (i.e. Core Person Record). It assumes (i.e. does not double check) that all data is present in the input records, INCLUDING term frequency adjustments. It memoizes generated SQL code so that it’s extremely fast (0.001 seconds for a pariwise prediction, which is at least 10x faster. Previously most of the time taken was generating code)

  • When these functions are called, the retain_intermediate_calculation_columns and retain_matching_columns settings are always set to True irrespective of the settings in the main settings object. This means users can e.g. run predict using efficient settings, but generate a handful of waterfalls using this function

Supercedes #2426

@RobinL RobinL changed the title (WIP) improve compare two records Improve compare two records Nov 12, 2024
df_records_right,
pipeline,
in_tablename="__splink__compare_two_records_right_with_tf",
out_tablename="__splink__compare_two_records_right_with_tf_uid_fix",
uid_str="_right",
)

sqls = block_using_rules_sqls(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using cartesian blocking, we don't need to run any complex blocking code.

In addition, this code creates and materilises a list of pairwise Ids, which is uses for the join. This is unnecessary in the context of a handful of records


This is needed e.g. when using linker.compare_two_records
or linker.inference.find_matches_to_new_records in which the user provides
new records which need tf adjustments computed
"""
tf_cols_already_populated = []
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can now include tf columns in the input data. If they do, do not recompute, and use whatever the user has provided

@RobinL RobinL requested a review from ADBond November 12, 2024 09:16
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition, all seems reasonable 👍

Comment on lines 146 to 149
settings_obj._retain_matching_columns = retain_matching_columns
settings_obj._retain_intermediate_calculation_columns = (
retain_intermediate_calculation_columns
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this in this case, as we only instantiate the Settings in this function

@RobinL RobinL merged commit fff3433 into master Nov 13, 2024
25 checks passed
@RobinL RobinL deleted the improve_compare_two_records branch November 13, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants