Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to _compute_cluster_metrics #1763

Merged
merged 15 commits into from
Dec 4, 2023
Merged

Fixes to _compute_cluster_metrics #1763

merged 15 commits into from
Dec 4, 2023

Conversation

zslade
Copy link
Contributor

@zslade zslade commented Nov 23, 2023

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Addressing Robin's comments on this PR for adding cluster metrics computation functionality to linker #1677

Give a brief description for the solution you have provided

  • Removed default argument threshold_match_probability: float = None
  • Updated the sql so that it works for link jobs
  • Made some other minor tweaks to align better with how things are set up in current codebase e.g. moved accessing unique ids to inside _size_density_sql() and added comments

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@zslade zslade mentioned this pull request Nov 23, 2023
9 tasks
@zslade zslade requested review from RobinL and ADBond and removed request for RobinL November 27, 2023 13:27
@zslade zslade marked this pull request as ready for review November 27, 2023 13:32
Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

Please could we have a test that it now works for the link_only case. Feel free to use the reprex from my example, or something different, whatever you prefer.

One other comment - should we be setting num edges to 0 rather than NaN in the case of a single-node cluster. I'm not sure density is defined, so a NaN type value is fine, but edges should probably be 0? (Sorry if you've had this discussion before)

I think you could use COALESCE(n_edges, 0) AS N_EDGES in the final sql statement for this

splink/cluster_metrics.py Outdated Show resolved Hide resolved
@zslade
Copy link
Contributor Author

zslade commented Nov 29, 2023

Thanks for your comment Robin!

Have set n_edges to 0 for single nodes and added a test for link_only too.

Do you think it would be worth casting the n_edges column from float to int as well? Technically n_edges should be an integer anyway but perhaps this would be better for performance as well? I think the SQL would then no longer be backend agnostic however (e.g. INT for spark and INTEGER for duckdb)

splink/cluster_metrics.py Outdated Show resolved Hide resolved
Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice work - one minor change in a comment then this should be good to go

If you hit the 'update branch' button below, then pull to your local machine, it'll mean this PR is then up to date with latest master

image

@zslade zslade requested a review from RobinL December 4, 2023 10:55
Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@zslade zslade merged commit 9075a67 into master Dec 4, 2023
10 checks passed
@zslade zslade deleted the cluster_metrics_updates branch December 4, 2023 12:41
@zslade zslade mentioned this pull request Mar 5, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants