Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster metrics #1677

Merged
merged 14 commits into from
Nov 8, 2023
Merged

Cluster metrics #1677

merged 14 commits into from
Nov 8, 2023

Conversation

zslade
Copy link
Contributor

@zslade zslade commented Oct 27, 2023

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Related to #1538, #1001, #539

Give a brief description for the solution you have provided

1. New method added to linker for computing cluster metrics.

  • Wasn't sure of the best location so have put below cluster_pairwise_predictions_at_threshold as closely related.
  • Once more metrics are added, the idea is to include an option metrics for the user to select which metrics they wish to be computed, e.g. "default", "all", or a custom list ["size", "is_bridge"]

2. New script cluster_metrics.py added to contain functions for generating the sql for computing different metrics

3. Test added for _compute_cluster_metrics

  • Test passes

Thank you @ADBond and @ThomasHepworth for your help and input

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@zslade zslade marked this pull request as ready for review October 31, 2023 17:05
@zslade zslade requested review from samnlindsay and ADBond October 31, 2023 17:05
splink/cluster_metrics.py Outdated Show resolved Hide resolved
splink/cluster_metrics.py Outdated Show resolved Hide resolved
@samnlindsay
Copy link
Contributor

I'll defer to @ADBond for useful feedback, but from my point of view this is a good start to making metrics available in order to facilitate the hopefully trivial step of adding chosen metrics to edge/cluster tables on request in order to make them available to cluster studio etc.

@zslade
Copy link
Contributor Author

zslade commented Nov 7, 2023

Thank you @samnlindsay and @ADBond

@ADBond, have addressed both your comments and method still seems to be working as intended 🤞.

Additional small name change to an arg of the sql generating function to _unique_id_col as I think it's clearer and more in keeping with how things are written in the rest of the codebase

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making those changes @zslade! Just one small issue left about dealing with default unique_id columns, but after that is addressed happy for you to go ahead and merge this 👍

splink/linker.py Outdated Show resolved Hide resolved
@zslade zslade merged commit 442810b into master Nov 8, 2023
8 checks passed
@zslade zslade deleted the cluster_metrics branch November 8, 2023 16:18
self,
df_predict: SplinkDataFrame,
df_clustered: SplinkDataFrame,
threshold_match_probability: float = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you get an error if this is set to None so perhaps it shouldn't have a default argument?

reprex
from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        exact_match_rule("surname"),
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
    "max_iterations": 10,
    "em_convergence": 0.01,
}


linker = DuckDBLinker(df, settings)

# linker.profile_columns(
#     ["first_name", "surname", "first_name || surname", "concat(city, first_name)"]
# )


linker.estimate_u_using_random_sampling(target_rows=1e6)


blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


df_predict = linker.predict()
df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
linker._compute_cluster_metrics(df_predict, df_clustered)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this! Have removed default. Adding this and other fixes to a new PR here

count(*) AS n_nodes,
sum(e.count_edges) AS n_edges
FROM {clusters_table} AS c
LEFT JOIN __splink__count_edges e ON c.{_unique_id_col} = e.{unique_id_col_l}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this join needs to account for the source dataset

Reprex
import pandas as pd
from IPython.display import display

from splink.duckdb.duckdb_comparison_library import (
    exact_match,
)
from splink.duckdb.duckdb_linker import DuckDBLinker

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "link_only",
    "comparisons": [
        exact_match("first_name"),
        exact_match("surname"),
        exact_match("dob"),
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
}


df_1 = [
    {"unique_id": 1, "first_name": "Tom", "surname": "Fox", "dob": "1980-01-01"},
    {"unique_id": 2, "first_name": "Amy", "surname": "Lee", "dob": "1980-01-01"},
]


df_2 = [
    {"unique_id": 1, "first_name": "Bob", "surname": "Ray", "dob": "1999-09-22"},
    {"unique_id": 2, "first_name": "Amy", "surname": "Lee", "dob": "1980-01-01"},
]

df_1 = pd.DataFrame(df_1)
df_2 = pd.DataFrame(df_2)

linker = DuckDBLinker(
    [df_1, df_2], settings, input_table_aliases=["df_left", "df_right"]
)


df_predict = linker.predict()
display(df_predict.as_pandas_dataframe())


df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
display(df_clustered.as_pandas_dataframe().sort_values("cluster_id"))

linker.debug_mode=True
linker._compute_cluster_metrics(df_predict, df_clustered, 0.9).as_pandas_dataframe()

Copy link
Member

@RobinL RobinL Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member

@RobinL RobinL Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few Splink functions that may help with solving this.

Maybe _unique_id_input_columns

or possibly here

CONCAT_SEPARATOR = "-__-"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I think I've got it working correctly with the changes made here 🙏

@zslade zslade mentioned this pull request Nov 23, 2023
10 tasks
@zslade zslade mentioned this pull request Mar 5, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants