return data class instead of dictionary #1887

zslade · 2024-01-25T14:08:54Z

Type of PR

BUG
FEAT
MAINT
DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

Following user feedback, the _compute_graph_metrics() method has been updated to return a data class, instead of a dictionary of splink dataframes.

This provides a more familiar API syntax, e.g. compute_graph_metrics.nodes
A repr description has been included to give a useful print out to the user
Can be unpacked like a tuple (instructions included in the repr)

I have tested on the fake_100 dataset. To reproduce:

    from splink.datasets import splink_datasets
    from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
    from splink.duckdb.comparison_library import (
        exact_match,
        levenshtein_at_thresholds,
    )
    from splink.duckdb.linker import DuckDBLinker
    
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    df = splink_datasets.fake_1000
    
    settings = {
        "probability_two_random_records_match": 0.01,
        "link_type": "dedupe_only",
        "blocking_rules_to_generate_predictions": [
            block_on(["first_name"]),
            exact_match_rule("surname"),
        ],
        "comparisons": [
            levenshtein_at_thresholds("first_name", 2),
            exact_match("surname"),
            exact_match("dob"),
            exact_match("city", term_frequency_adjustments=True),
            exact_match("email"),
        ],
        "retain_intermediate_calculation_columns": True,
        "additional_columns_to_retain": ["cluster"],
        "max_iterations": 10,
        "em_convergence": 0.01,
    }
    
    
    linker = DuckDBLinker(df, settings)
    
    linker.estimate_u_using_random_sampling(target_rows=1e6)
    
    
    blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
    linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
    
    
    blocking_rule = "l.dob = r.dob"
    linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
    
    
    df_predict = linker.predict()
    df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
    
    df_graph_metrics = linker._compute_graph_metrics(
        df_predict=df_predict, df_clustered=df_clustered, threshold_match_probability=0.9
    )
    
    node_metrics, edge_metrics, cluster_metrics = (
        df_graph_metrics.nodes,
        df_graph_metrics.edges,
        df_graph_metrics.clusters,
    )

    node_metrics.as_pandas_dataframe()

PR Checklist

Added documentation for changes
Added feature to example notebooks or tutorial (if appropriate)
Added tests (if appropriate)
Updated CHANGELOG.md (if appropriate)
Made changes based off the latest version of Splink
Run the linter

ADBond

Looks good - definitely think this is a bit nicer to use.

Couple of comments, but happy for you to decide what to do with them. If you do decide to implement the iteration will be happy to have another look over after

ADBond · 2024-01-29T10:34:47Z

splink/cluster_metrics.py

+```node_metrics, edge_metrics, cluster_metrics = (
+    df_graph_metrics.nodes,
+    df_graph_metrics.edges,
+    df_graph_metrics.clusters,
+)```


I think that the suggestion was to be able to unpack directly, node_metrics, edge_metrics, cluster_metrics = df_graph_metrics, or more likely in practice:

node_metrics, edge_metrics, cluster_metrics = linker.compute_graph_metrics(...)

so that you can immediately bypass the class itself if you want. Might want to check with Tom though if that's what he meant.

I'm personally not massively fussed about this feature though, so happy to leave if it's going to be too fiddly

I'm happy to leave for now and implement at a later date if the user need becomes apparent :)

ADBond · 2024-01-29T10:37:50Z

splink/cluster_metrics.py

+        return """
+A data class of Splink dataframes containing metrics for nodes, edges and clusters.
+
+Access dataframes via attributes:
+`compute_graph_metrics.nodes` for node metrics,
+`compute_graph_metrics.edges` for edge metrics and


Just personal preference but I always find it weird having multi-line strings that mess with the indentation like this - I think that using implicit continuation and explicit newline "\n" characters reads clearer (see for example SplinkDataFrame.__repr__).

But not a big deal at all, so am fine if I'm in the minority on this

This is really useful feedback! It looks icky to me too so have updated 👍

zslade added 8 commits January 25, 2024 14:08

return data class instead of dictionary

a0ee952

add repr

0bee649

impose keyword args

3bdffed

lint

fbdeb6f

add description for unpacking

ac5f417

update tests

e043d3e

fix keyword arg in tests

11b22de

fix test

979723a

zslade marked this pull request as ready for review January 25, 2024 16:04

zslade requested a review from ADBond January 25, 2024 16:04

ADBond approved these changes Jan 29, 2024

View reviewed changes

zslade added 3 commits January 29, 2024 14:44

Merge branch 'master' into metrics_dataclass

d5991aa

improve repr

bb5e581

lint

a0d91cd

zslade merged commit 47f7d20 into master Jan 29, 2024
10 checks passed

zslade deleted the metrics_dataclass branch January 29, 2024 15:20

zslade mentioned this pull request Mar 5, 2024

Make graph metrics public #2027

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return data class instead of dictionary #1887

return data class instead of dictionary #1887

zslade commented Jan 25, 2024 •

edited

Loading

ADBond left a comment

ADBond Jan 29, 2024

zslade Jan 29, 2024

ADBond Jan 29, 2024

zslade Jan 29, 2024

return data class instead of dictionary #1887

return data class instead of dictionary #1887

Conversation

zslade commented Jan 25, 2024 • edited Loading

Type of PR

Is your Pull Request linked to an existing Issue or Pull Request?

Give a brief description for the solution you have provided

PR Checklist

ADBond left a comment

Choose a reason for hiding this comment

ADBond Jan 29, 2024

Choose a reason for hiding this comment

zslade Jan 29, 2024

Choose a reason for hiding this comment

ADBond Jan 29, 2024

Choose a reason for hiding this comment

zslade Jan 29, 2024

Choose a reason for hiding this comment

zslade commented Jan 25, 2024 •

edited

Loading