-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve compare two records #2498
Conversation
df_records_right, | ||
pipeline, | ||
in_tablename="__splink__compare_two_records_right_with_tf", | ||
out_tablename="__splink__compare_two_records_right_with_tf_uid_fix", | ||
uid_str="_right", | ||
) | ||
|
||
sqls = block_using_rules_sqls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're using cartesian blocking, we don't need to run any complex blocking code.
In addition, this code creates and materilises a list of pairwise Ids, which is uses for the join. This is unnecessary in the context of a handful of records
|
||
This is needed e.g. when using linker.compare_two_records | ||
or linker.inference.find_matches_to_new_records in which the user provides | ||
new records which need tf adjustments computed | ||
""" | ||
tf_cols_already_populated = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can now include tf columns in the input data. If they do, do not recompute, and use whatever the user has provided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition, all seems reasonable 👍
splink/internals/realtime.py
Outdated
settings_obj._retain_matching_columns = retain_matching_columns | ||
settings_obj._retain_intermediate_calculation_columns = ( | ||
retain_intermediate_calculation_columns | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need this in this case, as we only instantiate the Settings
in this function
This PR:
compare_two_records
Example code demoing new features here
I have left the realtime module private for now i.e. essentially it's 'beta functionality'. I think we should start using this ourselves internally for a bit to make sure we're happy with it, before putting in the public api.
When using
compare_two_records
, you previously had to precompute term frequency tables. You can now explicitly pass in term frequency adjustments in the input data e.g. a column calledtf_city
with hardcoded valueWhen using
compare_two_records
, you previously could ONLY pass two records as dictionaries. You can now pass typed dataframes, e.g. a duckdbpyrelation if it’s a DuckDB linker. This fixes a bug whereby previously it was not possible to runcompare_two_records
with complex data types e.g. a date type columncompare_two_records
previously did not produce afound_by_blocking_rules
column, this is now (optionally) outputted)compare_two_records
mode previously was slow for two reasons:I’ve now included beta functionality of running
compare_two_records
without a linker, which will provisionally be atsplink.internals.realtime.compare_records
. This ‘flavour’ of the function is deliiberately design for high performance (i.e. Core Person Record). It assumes (i.e. does not double check) that all data is present in the input records, INCLUDING term frequency adjustments. It memoizes generated SQL code so that it’s extremely fast (0.001
seconds for a pariwise prediction, which is at least 10x faster. Previously most of the time taken was generating code)When these functions are called, the
retain_intermediate_calculation_columns
andretain_matching_columns
settings are always set toTrue
irrespective of the settings in the main settings object. This means users can e.g. run predict using efficient settings, but generate a handful of waterfalls using this functionSupercedes #2426