-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster metrics #1677
Cluster metrics #1677
Conversation
I'll defer to @ADBond for useful feedback, but from my point of view this is a good start to making metrics available in order to facilitate the hopefully trivial step of adding chosen metrics to edge/cluster tables on request in order to make them available to cluster studio etc. |
Thank you @samnlindsay and @ADBond @ADBond, have addressed both your comments and method still seems to be working as intended 🤞. Additional small name change to an arg of the sql generating function to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making those changes @zslade! Just one small issue left about dealing with default unique_id
columns, but after that is addressed happy for you to go ahead and merge this 👍
Co-authored-by: ADBond <[email protected]>
self, | ||
df_predict: SplinkDataFrame, | ||
df_clustered: SplinkDataFrame, | ||
threshold_match_probability: float = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you get an error if this is set to None so perhaps it shouldn't have a default argument?
reprex
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker
df = splink_datasets.fake_1000
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["first_name"]),
exact_match_rule("surname"),
],
"comparisons": [
levenshtein_at_thresholds("first_name", 2),
exact_match("surname"),
exact_match("dob"),
exact_match("city", term_frequency_adjustments=True),
exact_match("email"),
],
"retain_intermediate_calculation_columns": True,
"additional_columns_to_retain": ["cluster"],
"max_iterations": 10,
"em_convergence": 0.01,
}
linker = DuckDBLinker(df, settings)
# linker.profile_columns(
# ["first_name", "surname", "first_name || surname", "concat(city, first_name)"]
# )
linker.estimate_u_using_random_sampling(target_rows=1e6)
blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
df_predict = linker.predict()
df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
linker._compute_cluster_metrics(df_predict, df_clustered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this! Have removed default. Adding this and other fixes to a new PR here
count(*) AS n_nodes, | ||
sum(e.count_edges) AS n_edges | ||
FROM {clusters_table} AS c | ||
LEFT JOIN __splink__count_edges e ON c.{_unique_id_col} = e.{unique_id_col_l} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this join needs to account for the source dataset
Reprex
import pandas as pd
from IPython.display import display
from splink.duckdb.duckdb_comparison_library import (
exact_match,
)
from splink.duckdb.duckdb_linker import DuckDBLinker
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "link_only",
"comparisons": [
exact_match("first_name"),
exact_match("surname"),
exact_match("dob"),
],
"retain_matching_columns": True,
"retain_intermediate_calculation_columns": True,
}
df_1 = [
{"unique_id": 1, "first_name": "Tom", "surname": "Fox", "dob": "1980-01-01"},
{"unique_id": 2, "first_name": "Amy", "surname": "Lee", "dob": "1980-01-01"},
]
df_2 = [
{"unique_id": 1, "first_name": "Bob", "surname": "Ray", "dob": "1999-09-22"},
{"unique_id": 2, "first_name": "Amy", "surname": "Lee", "dob": "1980-01-01"},
]
df_1 = pd.DataFrame(df_1)
df_2 = pd.DataFrame(df_2)
linker = DuckDBLinker(
[df_1, df_2], settings, input_table_aliases=["df_left", "df_right"]
)
df_predict = linker.predict()
display(df_predict.as_pandas_dataframe())
df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
display(df_clustered.as_pandas_dataframe().sort_values("cluster_id"))
linker.debug_mode=True
linker._compute_cluster_metrics(df_predict, df_clustered, 0.9).as_pandas_dataframe()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few Splink functions that may help with solving this.
Maybe _unique_id_input_columns
or possibly here
splink/splink/unique_id_concat.py
Line 1 in a9f5424
CONCAT_SEPARATOR = "-__-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I think I've got it working correctly with the changes made here 🙏
Type of PR
Is your Pull Request linked to an existing Issue or Pull Request?
Related to #1538, #1001, #539
Give a brief description for the solution you have provided
1. New method added to linker for computing cluster metrics.
cluster_pairwise_predictions_at_threshold
as closely related.metrics
for the user to select which metrics they wish to be computed, e.g. "default", "all", or a custom list ["size", "is_bridge"]2. New script
cluster_metrics.py
added to contain functions for generating the sql for computing different metrics3. Test added for
_compute_cluster_metrics
Thank you @ADBond and @ThomasHepworth for your help and input
PR Checklist