Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return data class instead of dictionary #1887

Merged
merged 11 commits into from
Jan 29, 2024
Merged

return data class instead of dictionary #1887

merged 11 commits into from
Jan 29, 2024

Conversation

zslade
Copy link
Contributor

@zslade zslade commented Jan 25, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

#1677 and #1806

Give a brief description for the solution you have provided

Following user feedback, the _compute_graph_metrics() method has been updated to return a data class, instead of a dictionary of splink dataframes.

  • This provides a more familiar API syntax, e.g. compute_graph_metrics.nodes
  • A repr description has been included to give a useful print out to the user
  • Can be unpacked like a tuple (instructions included in the repr)
I have tested on the fake_100 dataset. To reproduce:
    from splink.datasets import splink_datasets
    from splink.duckdb.blocking_rule_library import block_on, exact_match_rule
    from splink.duckdb.comparison_library import (
        exact_match,
        levenshtein_at_thresholds,
    )
    from splink.duckdb.linker import DuckDBLinker
    
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    df = splink_datasets.fake_1000
    
    settings = {
        "probability_two_random_records_match": 0.01,
        "link_type": "dedupe_only",
        "blocking_rules_to_generate_predictions": [
            block_on(["first_name"]),
            exact_match_rule("surname"),
        ],
        "comparisons": [
            levenshtein_at_thresholds("first_name", 2),
            exact_match("surname"),
            exact_match("dob"),
            exact_match("city", term_frequency_adjustments=True),
            exact_match("email"),
        ],
        "retain_intermediate_calculation_columns": True,
        "additional_columns_to_retain": ["cluster"],
        "max_iterations": 10,
        "em_convergence": 0.01,
    }
    
    
    linker = DuckDBLinker(df, settings)
    
    linker.estimate_u_using_random_sampling(target_rows=1e6)
    
    
    blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
    linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
    
    
    blocking_rule = "l.dob = r.dob"
    linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
    
    
    df_predict = linker.predict()
    df_clustered = linker.cluster_pairwise_predictions_at_threshold(df_predict, 0.9)
    
    df_graph_metrics = linker._compute_graph_metrics(
        df_predict=df_predict, df_clustered=df_clustered, threshold_match_probability=0.9
    )
    
    node_metrics, edge_metrics, cluster_metrics = (
        df_graph_metrics.nodes,
        df_graph_metrics.edges,
        df_graph_metrics.clusters,
    )

    node_metrics.as_pandas_dataframe()

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@zslade zslade marked this pull request as ready for review January 25, 2024 16:04
@zslade zslade requested a review from ADBond January 25, 2024 16:04
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - definitely think this is a bit nicer to use.

Couple of comments, but happy for you to decide what to do with them. If you do decide to implement the iteration will be happy to have another look over after

Comment on lines 159 to 163
```node_metrics, edge_metrics, cluster_metrics = (
df_graph_metrics.nodes,
df_graph_metrics.edges,
df_graph_metrics.clusters,
)```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the suggestion was to be able to unpack directly, node_metrics, edge_metrics, cluster_metrics = df_graph_metrics, or more likely in practice:

node_metrics, edge_metrics, cluster_metrics = linker.compute_graph_metrics(...)

so that you can immediately bypass the class itself if you want. Might want to check with Tom though if that's what he meant.

I'm personally not massively fussed about this feature though, so happy to leave if it's going to be too fiddly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to leave for now and implement at a later date if the user need becomes apparent :)

Comment on lines 150 to 155
return """
A data class of Splink dataframes containing metrics for nodes, edges and clusters.

Access dataframes via attributes:
`compute_graph_metrics.nodes` for node metrics,
`compute_graph_metrics.edges` for edge metrics and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just personal preference but I always find it weird having multi-line strings that mess with the indentation like this - I think that using implicit continuation and explicit newline "\n" characters reads clearer (see for example SplinkDataFrame.__repr__).

But not a big deal at all, so am fine if I'm in the minority on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really useful feedback! It looks icky to me too so have updated 👍

@zslade zslade merged commit 47f7d20 into master Jan 29, 2024
10 checks passed
@zslade zslade deleted the metrics_dataclass branch January 29, 2024 15:20
@zslade zslade mentioned this pull request Mar 5, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants