Cluster metrics #1677

zslade · 2023-10-27T08:35:15Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

1. New method added to linker for computing cluster metrics.

Wasn't sure of the best location so have put below cluster_pairwise_predictions_at_threshold as closely related.
Once more metrics are added, the idea is to include an option metrics for the user to select which metrics they wish to be computed, e.g. "default", "all", or a custom list ["size", "is_bridge"]

2. New script cluster_metrics.py added to contain functions for generating the sql for computing different metrics

3. Test added for _compute_cluster_metrics

Test passes

Thank you @ADBond and @ThomasHepworth for your help and input

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Made changes based off the latest version of Splink
Run the linter

splink/cluster_metrics.py

samnlindsay · 2023-11-06T18:05:37Z

I'll defer to @ADBond for useful feedback, but from my point of view this is a good start to making metrics available in order to facilitate the hopefully trivial step of adding chosen metrics to edge/cluster tables on request in order to make them available to cluster studio etc.

zslade · 2023-11-07T14:31:41Z

Thank you @samnlindsay and @ADBond

@ADBond, have addressed both your comments and method still seems to be working as intended 🤞.

Additional small name change to an arg of the sql generating function to _unique_id_col as I think it's clearer and more in keeping with how things are written in the rest of the codebase

ADBond

Thanks for making those changes @zslade! Just one small issue left about dealing with default unique_id columns, but after that is addressed happy for you to go ahead and merge this 👍

splink/linker.py

Co-authored-by: ADBond <[email protected]>

RobinL · 2023-11-22T17:10:21Z

splink/linker.py

+        self,
+        df_predict: SplinkDataFrame,
+        df_clustered: SplinkDataFrame,
+        threshold_match_probability: float = None,


I think you get an error if this is set to None so perhaps it shouldn't have a default argument?

reprex

from splink.datasets import splink_datasets from splink.duckdb.blocking_rule_library import block_on, exact_match_rule from splink.duckdb.comparison_library import ( exact_match, levenshtein_at_thresholds, ) from splink.duckdb.linker import DuckDBLinker df = splink_datasets.fake_1000 settings = { "probability_two_random_records_match": 0.01, "link_type": "dedupe_only", "blocking_rules_to_generate_predictions": [ block_on(["first_name"]), exact_match_rule("surname"), ], "comparisons": [ levenshtein_at_thresholds("first_name", 2), exact_match("surname"), exact_match("dob"), exact_match("city", term_frequency_adjustments=True), exact_match("email"), ], "retain_intermediate_calculation_columns": True, "additional_columns_to_retain": ["cluster"], "max_iterations": 10, "em_convergence": 0.01, } linker = DuckDBLinker(df, settings) # linker.profile_columns( # ["first_name", "surname", "first_name || surname", "concat(city, first_name)"] # ) linker.estimate_u_using_random_sampling(target_rows=1e6) blocking_rule = "l.first_name = r.first_name and l.surname = r.surname" linker.estimate_parameters_using_expectation_maximisation(blocking_rule) blocking_rule = "l.dob = r.dob" linker.estimate_parameters_using_expectation_maximisation(blocking_rule) df_predict = linker.predict() df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9) linker._compute_cluster_metrics(df_predict, df_clustered)

Thanks for spotting this! Have removed default. Adding this and other fixes to a new PR here

RobinL · 2023-11-22T17:19:54Z

splink/cluster_metrics.py

+            count(*) AS n_nodes,
+            sum(e.count_edges) AS n_edges
+        FROM {clusters_table} AS c
+        LEFT JOIN __splink__count_edges e ON c.{_unique_id_col} = e.{unique_id_col_l}


I think this join needs to account for the source dataset

Reprex

import pandas as pd from IPython.display import display from splink.duckdb.duckdb_comparison_library import ( exact_match, ) from splink.duckdb.duckdb_linker import DuckDBLinker settings = { "probability_two_random_records_match": 0.01, "link_type": "link_only", "comparisons": [ exact_match("first_name"), exact_match("surname"), exact_match("dob"), ], "retain_matching_columns": True, "retain_intermediate_calculation_columns": True, } df_1 = [ {"unique_id": 1, "first_name": "Tom", "surname": "Fox", "dob": "1980-01-01"}, {"unique_id": 2, "first_name": "Amy", "surname": "Lee", "dob": "1980-01-01"}, ] df_2 = [ {"unique_id": 1, "first_name": "Bob", "surname": "Ray", "dob": "1999-09-22"}, {"unique_id": 2, "first_name": "Amy", "surname": "Lee", "dob": "1980-01-01"}, ] df_1 = pd.DataFrame(df_1) df_2 = pd.DataFrame(df_2) linker = DuckDBLinker( [df_1, df_2], settings, input_table_aliases=["df_left", "df_right"] ) df_predict = linker.predict() display(df_predict.as_pandas_dataframe()) df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9) display(df_clustered.as_pandas_dataframe().sort_values("cluster_id")) linker.debug_mode=True linker._compute_cluster_metrics(df_predict, df_clustered, 0.9).as_pandas_dataframe()

There are a few Splink functions that may help with solving this.

Maybe _unique_id_input_columns

or possibly here

splink/splink/unique_id_concat.py

Line 1 in a9f5424

CONCAT_SEPARATOR = "-__-"

Thank you. I think I've got it working correctly with the changes made here 🙏

zslade added 8 commits October 27, 2023 09:05

Added method for computing cluster metrics

3cf22d1

SQL for computing cluster size and density

4ead9e2

linker -> self

9fd5463

Corrected df names

c2133ef

Corrected table names

973c462

Corrected threshold argument

cef9f30

Add test for cluster metrics

37d8583

Linted

a4cf7c6

zslade marked this pull request as ready for review October 31, 2023 17:05

zslade requested review from samnlindsay and ADBond October 31, 2023 17:05

ADBond reviewed Nov 2, 2023

View reviewed changes

splink/cluster_metrics.py Outdated Show resolved Hide resolved

splink/cluster_metrics.py Outdated Show resolved Hide resolved

zslade added 5 commits November 7, 2023 13:38

Add InputColumn and update arg name

1f50b7a

Updated arg value

fafcd14

Added __splink__ to tables names

5831de0

Imported InputColumn

18de564

lint with black

533bd50

ADBond approved these changes Nov 7, 2023

View reviewed changes

splink/linker.py Outdated Show resolved Hide resolved

Use settings obj instead of dict

7570b63

Co-authored-by: ADBond <[email protected]>

zslade merged commit 442810b into master Nov 8, 2023
8 checks passed

zslade deleted the cluster_metrics branch November 8, 2023 16:18

samnlindsay mentioned this pull request Nov 9, 2023

[FEAT] Cluster IDs based on node centrality #1720

Open

RobinL reviewed Nov 22, 2023

View reviewed changes

zslade mentioned this pull request Nov 23, 2023

Fixes to _compute_cluster_metrics #1763

Merged

10 tasks

zslade mentioned this pull request Jan 25, 2024

return data class instead of dictionary #1887

Merged

10 tasks

zslade mentioned this pull request Mar 5, 2024

Make graph metrics public #2027

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster metrics #1677

Cluster metrics #1677

zslade commented Oct 27, 2023 •

edited

Loading

samnlindsay commented Nov 6, 2023

zslade commented Nov 7, 2023

ADBond left a comment

RobinL Nov 22, 2023

zslade Nov 23, 2023

RobinL Nov 22, 2023

RobinL Nov 22, 2023 •

edited

Loading

RobinL Nov 22, 2023 •

edited

Loading

zslade Nov 27, 2023

Cluster metrics #1677

Cluster metrics #1677

Conversation

zslade commented Oct 27, 2023 • edited Loading

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

samnlindsay commented Nov 6, 2023

zslade commented Nov 7, 2023

ADBond left a comment

Choose a reason for hiding this comment

RobinL Nov 22, 2023

Choose a reason for hiding this comment

zslade Nov 23, 2023

Choose a reason for hiding this comment

RobinL Nov 22, 2023

Choose a reason for hiding this comment

RobinL Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

RobinL Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

zslade Nov 27, 2023

Choose a reason for hiding this comment

zslade commented Oct 27, 2023 •

edited

Loading

RobinL Nov 22, 2023 •

edited

Loading

RobinL Nov 22, 2023 •

edited

Loading