Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

856 profile null column #1339

Merged
merged 16 commits into from
Nov 8, 2023
8 changes: 1 addition & 7 deletions splink/analyse_blocking.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@
from .blocking import _sql_gen_where_condition, block_using_rules_sql
from .misc import calculate_cartesian, calculate_reduction_ratio

import pandas as pd

# https://stackoverflow.com/questions/39740632/python-type-hinting-without-cyclic-imports
if TYPE_CHECKING:
from .linker import Linker
Expand Down Expand Up @@ -40,7 +38,6 @@ def cumulative_comparisons_generated_by_blocking_rules(
linker: Linker,
blocking_rules,
output_chart=True,
return_dataframe=False
):
# Deepcopy our original linker so we can safely adjust our settings.
# This is particularly important to ensure we don't overwrite our
Expand Down Expand Up @@ -141,7 +138,4 @@ def cumulative_comparisons_generated_by_blocking_rules(

linker._analyse_blocking_mode = False

if return_dataframe:
return pd.DataFrame(br_comparisons)
else :
return br_comparisons
return br_comparisons
1 change: 1 addition & 0 deletions splink/exceptions.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import warnings


# base class for any type of custom exception
class SplinkException(Exception):
pass
Expand Down
38 changes: 25 additions & 13 deletions splink/profile_data.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
import logging
import re
from copy import deepcopy

from .charts import load_chart_definition, vegalite_or_json
from .misc import ensure_is_list

logger = logging.getLogger(__name__)


def _group_name(cols_or_expr):
cols_or_expr = re.sub(r"[^0-9a-zA-Z_]", " ", cols_or_expr)
Expand Down Expand Up @@ -230,19 +233,28 @@ def profile_columns(linker, column_expressions, top_n=10, bottom_n=10):
percentile_rows = [
p for p in percentile_rows_all if p["group_name"] == _group_name(expression)
]
percentile_rows = _add_100_percentile_to_df_percentiles(percentile_rows)
top_n_rows = [
p for p in top_n_rows_all if p["group_name"] == _group_name(expression)
]
bottom_n_rows = [
p for p in bottom_n_rows_all if p["group_name"] == _group_name(expression)
]
# remove concat blank from expression title
expression = expression.replace(", ' '", "")
inner_chart = _get_inner_chart_spec_freq(
percentile_rows, top_n_rows, bottom_n_rows, expression
)
inner_charts.append(inner_chart)
if percentile_rows == []:
ThomasHepworth marked this conversation as resolved.
Show resolved Hide resolved
logger.warning(
"Warning: No charts produced for "
f"{expression}"
" as the column only contains null values."
)
else:
percentile_rows = _add_100_percentile_to_df_percentiles(percentile_rows)
top_n_rows = [
p for p in top_n_rows_all if p["group_name"] == _group_name(expression)
]
bottom_n_rows = [
p
for p in bottom_n_rows_all
if p["group_name"] == _group_name(expression)
]
# remove concat blank from expression title
expression = expression.replace(", ' '", "")
inner_chart = _get_inner_chart_spec_freq(
percentile_rows, top_n_rows, bottom_n_rows, expression
)
inner_charts.append(inner_chart)
outer_spec = deepcopy(_outer_chart_spec_freq)

outer_spec["vconcat"] = inner_charts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, we get an odd output if the user tries to profile a single empty column...
Screenshot 2023-06-29 at 13 41 14

Perhaps it would make sense to check if there are any charts to output and then only return a result if there is?

if inner_charts:
...
    return vegalite_or_json(outer_spec)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with this being an odd error message- this needs to return something so the user knows it's their error, and it's that or something more general - i.e. "No charts produced due to missing data", but I think the specificity aids in de-bugging this if you've done it by accident.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message isn't what's odd about it. In the special case that there's only one null column (or multiple columns that are all null) then the function still attempts to produce an empty chart after printing the error message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, apologies, I clearly reviewed this fairly quickly last time.

My comment was more directed towards the blank chart that is being generated, as Sam L mentioned above.

Expand Down
20 changes: 20 additions & 0 deletions tests/test_profile_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,3 +176,23 @@ def test_profile_using_spark(df_spark):
)

assert len(generate_raw_profile_dataset([["first_name", "blank"]], linker)) == 0


def test_profile_null_columns(caplog):
ThomasHepworth marked this conversation as resolved.
Show resolved Hide resolved

df = pd.DataFrame(
[
{"unique_id": 1, "test_1": 1, "test_2": None},
]
)

linker = DuckDBLinker(df)

linker.profile_columns(["test_1", "test_2"])
ThomasHepworth marked this conversation as resolved.
Show resolved Hide resolved

captured_logs = caplog.text

assert (
"Warning: No charts produced for test_2 as the column only contains null values."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a multi-line string to satisfy the linter (or add a #noqa mark if you're feeling particularly lazy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work and satisfies the linter.

Suggested change
assert (
"Warning: No charts produced for test_2 as the column only contains null values."
assert (
"Warning: No charts produced for test_2 as the column only contains "
"null values."
)

in captured_logs
)