[BUG] Modify `concatenate_columns` ignore_empty output #1166

Fu-Jie · 2022-09-08T15:40:56Z

PR Description

Please describe the changes proposed in the pull request:

modify concatenate_column ignore_empty output (pd.NA,pd.NaT,None,np.nan => "")
modify concatenate_columns test case

This PR resolves #1164.

PR Checklist

Please ensure that you have done the following:

PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

If you're not on the contributors list, add yourself to AUTHORS.md.

Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

codecov · 2022-09-08T16:29:18Z

Codecov Report

Merging #1166 (66ea4c0) into dev (68b8bb0) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1166      +/-   ##
==========================================
- Coverage   98.04%   98.01%   -0.03%     
==========================================
  Files          76       76              
  Lines        3524     3525       +1     
==========================================
  Hits         3455     3455              
- Misses         69       70       +1

Zeroto521

Thanks for fixing this problem.

I thought ignore_empty=True we could fill nan.

    df[new_column_name] = (
        df[column_names].fillna("").astype(str).agg(sep.join, axis=1)
        if ignore_empty
        else df[column_names].astype(str).agg(sep.join, axis=1)
    )

And It's better to do the fill operation and then do the change type operation.
It's hard to say how many nan strings there have.

df.fillna("").astype(str)

df.astype(str).replace(["NaT", "nan", "<NA>"], "")

AUTHORS.md

janitor/functions/concatenate_columns.py

ericmjl

@Fu-Jie thank you for your contribution here! I'm noticing that the change will likely be a breaking change, i.e. it modifies the old expected behaviour of the function. Can we ensure that the suggested changes are toggleable via function arguments?

ericmjl · 2022-09-11T21:12:35Z

janitor/functions/concatenate_columns.py

+        df[column_names]
+        .astype(str)
+        .replace(["NaT", "nan", "<NA>"], "")
+        .agg(sep.join, axis=1)


We might want an argument here to toggle between old and new behaviours. Would you be open to doing so?

I'm sorry, the translation software I used may not have described it clearly。
For ignoring null values, my idea comes from the implementation of Excel, feeling that the implementation of Excel is more in line with the actual use
https://support.microsoft.com/en-us/office/textjoin-function-357b449a-ec91-49d0-80c3-0e8fc845691c

@ericmjl
It is my understanding that if null values are not ignored, then this is more reasonable.

Pseudo-code

split = ',' if ignore_empty = False then 1,2,pd.NA -> 1,2, if ignore_empty = True then 1,2,pd.NA -> 1,2

Co-authored-by: 40% <[email protected]>

Fu-Jie · 2022-09-13T03:04:07Z

And It's better to do the fill operation and then do the change type operation. It's hard to say how many nan strings there have.
df.fillna("").astype(str)

df.astype(str).replace(["NaT", "nan", "<NA>"], "")

@Zeroto521 I am test None、pd.NA、pd.NaT、np.nan，there should be nothing else.

Fu-Jie · 2022-09-13T03:24:21Z

Thanks for fixing this problem.

I thought ignore_empty=True we could fill nan.

    df[new_column_name] = (
        df[column_names].fillna("").astype(str).agg(sep.join, axis=1)
        if ignore_empty
        else df[column_names].astype(str).agg(sep.join, axis=1)
    )

@Zeroto521
At first, I also thought of this solution，but there are some problems:
Fillna first, then astype, can not handle columns of type int or float。

Zeroto521 · 2022-09-13T06:30:30Z

import numpy as np
import pandas as pd


def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)


def astype_fillna(df, sep="-"):
    return df.astype(str).replace(["NaT", "nan", "<NA>"], "").agg(sep.join, axis=1)

# normal case
# both of them passed, but `astype_fillna` need to replace `None` and `'NaN'`.
pd.DataFrame(
    {
        "a": ["string", 1, 1.5, np.nan],
        "b": ["another_string", 0, pd.NA, None],
    }
).pipe(fillna_astype)
# 0    string-another_string
# 1                      1-0
# 2                     1.5-
# 3                        -
# dtype: object

pd.DataFrame(
    {
        "a": ["string", 1, 1.5, np.nan],
        "b": ["another_string", 0, pd.NA, None],
    }
).pipe(astype_fillna)
# 0    string-another_string
# 1                      1-0
# 2                     1.5-
# 3                    -None
# dtype: object


# this one is a special case. `astype_fillna` is failed.
# we only want to fill na value.
pd.DataFrame(
    {
        "a": ["string", np.nan, pd.NA, None],
        "b": ["another_string", "nan", "<NA>", "None"],
    }
).pipe(fillna_astype)
# 0    string-another_string
# 1                     -nan
# 2                    -<NA>
# 3                    -None
# dtype: object

pd.DataFrame(
    {
        "a": ["string", np.nan, pd.NA, None],
        "b": ["another_string", "nan", "<NA>", "None"],
    }
).pipe(astype_fillna)
# 0    string-another_string
# 1                        -  # wrong
# 2                        -  # wrong
# 3                None-None  # wrong
# dtype: object

Fu-Jie · 2022-09-13T08:03:44Z

@Zeroto521

normal case

It could be a pandas(1.3.5) version issue,my env both None and np.nan astype for "nan" ,Should need to increase the na value.

def astype_fillna(df, sep="-"):
    return df.astype(str).replace(["NaT", "nan", "<NA>","None"], "").agg(sep.join, axis=1)

about fillna astype float or int issue

import pandas as pd 
def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)
pd.DataFrame(
    {
        "b": [1, 0, pd.NA, 3],
    }
    ,dtype=pd.Float32Dtype()
).pipe(fillna_astype)
##
## TypeError: <U1 cannot be converted to a FloatingDtype

special case

In my opinion, what should be dealt with is the null value of the column, not the text that represents the empty meaning.

Zeroto521 · 2022-09-13T08:30:15Z

I think that isn't a good example for fillna_astype.
Once you point out the type of data, some of the methods you can't use. This is what's the dtype meaning.

import pandas as pd 
def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)
pd.DataFrame(
    {
        "b": [1, 0, pd.NA, 3],
    }
    ,dtype=pd.Float32Dtype()
).pipe(fillna_astype)
##
## TypeError: <U1 cannot be converted to a FloatingDtype

Fu-Jie · 2022-09-13T08:35:32Z

I think that isn't a good example for fillna_astype. Once you point out the type of data, some of the methods you can't use. This is what's the dtype meaning.
import pandas as pd 
def fillna_astype(df, sep="-"):
    return df.fillna("").astype(str).agg(sep.join, axis=1)
pd.DataFrame(
    {
        "b": [1, 0, pd.NA, 3],
    }
    ,dtype=pd.Float32Dtype()
).pipe(fillna_astype)
##
## TypeError: <U1 cannot be converted to a FloatingDtype

This situation may appear in the read_sql int column（pandas dafaultl parse） or read_excel specify the dtype = 'int'
In addition to this problem, I also think it will be better to use Fillna first。

use astype("string")

samukweku · 2022-09-20T12:56:38Z

Any thoughts on the progress of this PR @thatlittleboy @ericmjl @Zeroto521 ? @Fu-Jie kindly rebase so that this PR is updated to the latest

thatlittleboy

I think we're close! just need to clear up some inconsistencies against the docstrings

janitor/functions/concatenate_columns.py

thatlittleboy · 2022-09-20T16:35:05Z

tests/functions/test_concatenate_columns.py

@@ -28,7 +28,7 @@ def test_concatenate_columns_null_values(missingdata_df):
        new_column_name="index",
        ignore_empty=True,
    )
-    expected_values = ["1.0-1", "2.0-2", "nan-3"] * 3
+    expected_values = ["1.0-1", "2.0-2", "3"] * 3


Please also update the docstrings for the test.

thatlittleboy · 2022-09-20T16:37:40Z

tests/functions/test_concatenate_columns.py

@@ -28,7 +28,7 @@ def test_concatenate_columns_null_values(missingdata_df):
        new_column_name="index",
        ignore_empty=True,
    )
-    expected_values = ["1.0-1", "2.0-2", "nan-3"] * 3
+    expected_values = ["1.0-1", "2.0-2", "3"] * 3
    assert expected_values == df["index"].tolist()




I also think it might be worth writing a test merging a custom dataframe with a float column (NaN), a datetime column (NaT) and a string column (None/NA?).

And assert the expected output accordingly.

Then, mention this PR or the attached issue in the test docstring as well, please.

for more information, see https://pre-commit.ci

Fu-Jie added 2 commits September 8, 2022 15:16

Modify ignore_empty output

41025d8

Modify ignore_empty output

d74a829

Fu-Jie changed the title ~~Modify concatenate_column ignore_empty output~~ [BUG] Modify concatenate_column ignore_empty output Sep 8, 2022

Fu-Jie changed the title ~~[BUG] Modify concatenate_column ignore_empty output~~ Modify concatenate_column ignore_empty output Sep 8, 2022

Fu-Jie changed the title ~~Modify concatenate_column ignore_empty output~~ [ENH] Modify concatenate_column ignore_empty output Sep 8, 2022

Fu-Jie changed the title ~~[ENH] Modify concatenate_column ignore_empty output~~ [BUG] Modify concatenate_column ignore_empty output Sep 8, 2022

solve doc format

f012b9d

Zeroto521 requested changes Sep 11, 2022

View reviewed changes

AUTHORS.md Outdated Show resolved Hide resolved

janitor/functions/concatenate_columns.py Show resolved Hide resolved

ericmjl requested changes Sep 11, 2022

View reviewed changes

Fu-Jie and others added 2 commits September 13, 2022 10:21

Update AUTHORS.md

adaaf36

Co-authored-by: 40% <[email protected]>

Update janitor/functions/concatenate_columns.py

30c1f3c

Co-authored-by: 40% <[email protected]>

Fu-Jie changed the title ~~[BUG] Modify concatenate_column ignore_empty output~~ [BUG] Modify concatenate_columns ignore_empty output Sep 13, 2022

Update concatenate_columns.py

a25d983

use astype("string")

thatlittleboy requested changes Sep 20, 2022

View reviewed changes

ericmjl and others added 4 commits November 3, 2022 09:12

Merge branch 'dev' into solve-ignore-empty

1423129

[pre-commit.ci] auto fixes from pre-commit.com hooks

08fe78c

for more information, see https://pre-commit.ci

Merge branch 'dev' into solve-ignore-empty

a29a463

Update CHANGELOG.md

66ea4c0

Zeroto521 marked this pull request as draft November 28, 2022 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Modify `concatenate_columns` ignore_empty output #1166

[BUG] Modify `concatenate_columns` ignore_empty output #1166

Fu-Jie commented Sep 8, 2022

codecov bot commented Sep 8, 2022 •

edited

Loading

Zeroto521 left a comment

ericmjl left a comment

ericmjl Sep 11, 2022

Fu-Jie Sep 13, 2022

Fu-Jie Sep 13, 2022 •

edited

Loading

Fu-Jie commented Sep 13, 2022

Fu-Jie commented Sep 13, 2022

Zeroto521 commented Sep 13, 2022 •

edited

Loading

Fu-Jie commented Sep 13, 2022 •

edited

Loading

Zeroto521 commented Sep 13, 2022

Fu-Jie commented Sep 13, 2022 •

edited

Loading

samukweku commented Sep 20, 2022 •

edited

Loading

thatlittleboy left a comment

thatlittleboy Sep 20, 2022

thatlittleboy Sep 20, 2022

[BUG] Modify concatenate_columns ignore_empty output #1166

Are you sure you want to change the base?

[BUG] Modify concatenate_columns ignore_empty output #1166

Conversation

Fu-Jie commented Sep 8, 2022

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

codecov bot commented Sep 8, 2022 • edited Loading

Codecov Report

Zeroto521 left a comment

Choose a reason for hiding this comment

ericmjl left a comment

Choose a reason for hiding this comment

ericmjl Sep 11, 2022

Choose a reason for hiding this comment

Fu-Jie Sep 13, 2022

Choose a reason for hiding this comment

Fu-Jie Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Pseudo-code

Fu-Jie commented Sep 13, 2022

Fu-Jie commented Sep 13, 2022

Zeroto521 commented Sep 13, 2022 • edited Loading

Fu-Jie commented Sep 13, 2022 • edited Loading

normal case

about fillna astype float or int issue

special case

Zeroto521 commented Sep 13, 2022

Fu-Jie commented Sep 13, 2022 • edited Loading

samukweku commented Sep 20, 2022 • edited Loading

thatlittleboy left a comment

Choose a reason for hiding this comment

thatlittleboy Sep 20, 2022

Choose a reason for hiding this comment

thatlittleboy Sep 20, 2022

Choose a reason for hiding this comment

[BUG] Modify `concatenate_columns` ignore_empty output #1166

[BUG] Modify `concatenate_columns` ignore_empty output #1166

codecov bot commented Sep 8, 2022 •

edited

Loading

Fu-Jie Sep 13, 2022 •

edited

Loading

Zeroto521 commented Sep 13, 2022 •

edited

Loading

Fu-Jie commented Sep 13, 2022 •

edited

Loading

Fu-Jie commented Sep 13, 2022 •

edited

Loading

samukweku commented Sep 20, 2022 •

edited

Loading