Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

856 profile null column #1339

Merged
merged 16 commits into from
Nov 8, 2023
Merged

856 profile null column #1339

merged 16 commits into from
Nov 8, 2023

Conversation

sama-ds
Copy link
Contributor

@sama-ds sama-ds commented Jun 16, 2023


name: '856_profile_null_column'
about: '856_profile_null_column'
title: '856_profile_null_column'
assignees: 'sama-ds'

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

#856

Give a brief description for the solution you have provided

Previously, if a null column was passed to profile columns it would cause an error. This PR checks whether the column is null, and if it is instead returns a warning message to say that the specific graph could not be made and returns the remaining graphs.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks at tutorials in splink_demos (if appropriate)
  • Added tests (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

…sed and produce a warning message as opposed to an error, and permitting the remainder of the charts generating.
@sama-ds sama-ds self-assigned this Jun 16, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 16, 2023

Test: test_2_rounds_1k_duckdb

Percentage change: -29.0%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
849 2022-07-12 18:40:05 1.89098 1.87463 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1785 2023-06-28 12:08:00 1.33457 1.3318 (detached head) 2649942 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 2649942

Test: test_2_rounds_1k_sqlite

Percentage change: -15.8%

date time stats_mean stats_min commit_info_branch commit_info_id machine_info_cpu_brand_raw machine_info_cpu_hz_actual_friendly commit_hash
851 2022-07-12 18:40:05 4.32179 4.25898 splink3 c334bb9 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz 2.7934 GHz c334bb9
1787 2023-06-28 12:08:00 3.59414 3.58509 (detached head) 2649942 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz 2.5939 GHz 2649942

Click here for vega lite time series charts

…g_rules to provide a dataframe."

This reverts commit 13a2925.

Reverting changes that were intended for another branch
@sama-ds sama-ds changed the title 856 profile null column WIP: 856 profile null column Jun 16, 2023
@sama-ds sama-ds changed the title WIP: 856 profile null column 856 profile null column Jun 16, 2023
@ThomasHepworth
Copy link
Contributor

Is this ready for review?

@sama-ds sama-ds requested a review from ThomasHepworth June 28, 2023 11:17
captured_logs = caplog.text

assert (
"Warning: No charts produced for test_2 as the column only contains null values."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a multi-line string to satisfy the linter (or add a #noqa mark if you're feeling particularly lazy.

Comment on lines 258 to 260
outer_spec = deepcopy(_outer_chart_spec_freq)

outer_spec["vconcat"] = inner_charts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we get an odd output if the user tries to profile a single empty column...
Screenshot 2023-06-29 at 13 41 14

Perhaps it would make sense to check if there are any charts to output and then only return a result if there is?

if inner_charts:
...
    return vegalite_or_json(outer_spec)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this being an odd error message- this needs to return something so the user knows it's their error, and it's that or something more general - i.e. "No charts produced due to missing data", but I think the specificity aids in de-bugging this if you've done it by accident.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message isn't what's odd about it. In the special case that there's only one null column (or multiple columns that are all null) then the function still attempts to produce an empty chart after printing the error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, apologies, I clearly reviewed this fairly quickly last time.

My comment was more directed towards the blank chart that is being generated, as Sam L mentioned above.

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

Some minor comments and then I'll approve.

@ThomasHepworth
Copy link
Contributor

@sama-ds it's just the blank chart creation that needs a minor adjustment and then I will approve.

@@ -59,6 +59,7 @@ def test_distance_function_comparison():
assert sum(df_pred[f"gamma_{col}"] == gamma_val) == expected_count


@mark_with_dialects_excluding()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to make this backend agnostic, you need to add some extra goodies.

See the construction of the following test -

helper = test_helpers[dialect]
.

For the backend agnostic logic to work, you need to use pick out a specific backend helper, grab the relevant comparison library (cl) and also supplies **helper.extra_linker_args() to the linker object:

linker = helper.Linker(df, settings, **helper.extra_linker_args())

If you want to do this, I'd recommend doing it in a separate PR. Then you won't need to worry about rebasing from master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I changed this in error- thanks for spotting. This is not relevant to this PR.

@ThomasHepworth
Copy link
Contributor

Could you please the linter?

Comment on lines 195 to 196
assert (
"Warning: No charts produced for test_2 as the column only contains null values."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work and satisfies the linter.

Suggested change
assert (
"Warning: No charts produced for test_2 as the column only contains null values."
assert (
"Warning: No charts produced for test_2 as the column only contains "
"null values."
)

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@sama-ds sama-ds merged commit 4b3d365 into master Nov 8, 2023
@sama-ds sama-ds deleted the 856_profile_null_column branch November 8, 2023 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants