[WIP] Mutate function #1229

samukweku · 2023-01-09T02:48:05Z

PR Description

Please describe the changes proposed in the pull request:

replace transform_column and transform_columns functions
transformation is with a tuple
support for multiple column transformations
support for transformations in the presence of a grouping
inspiration from dplyr's across and rdatatable's SD

At its core, it is nothing more than an elaborate for loop. All the hard work is passed on to Pandas. user should only use this if pd.assign is not enough

Examples:

   rater_id  being_rated_id  combine_id  avg_jump  avg_run  avg_swim category
0       100             200      100200         3        3         2    heats
1       100             200      100200         4        4         1    heats
2       101             200      101200         1        1         2   finals
3       101             200      101200         2        3         2   finals
4       102             201      102201         3        2         3    heats
5       103             202      103202         4        4         4   finals

# transform with a tuple
df.mutate(("avg*", "mean", "{_col}_2"), by=['combine_id', 'category'])
   avg_jump  avg_run  avg_swim  combine_id category  avg_jump_2  avg_run_2  avg_swim_2
0         3        3         2      100200    heats         3.5        3.5         1.5
1         4        4         1      100200    heats         3.5        3.5         1.5
2         1        1         2      101200   finals         1.5        2.0         2.0
3         2        3         2      101200   finals         1.5        2.0         2.0
4         3        2         3      102201    heats         3.0        2.0         3.0
5         4        4         4      103202   finals         4.0        4.0         4.0

This PR resolves #1226 and #1178.

PR Checklist

Please ensure that you have done the following:

PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.

If you're not on the contributors list, add yourself to AUTHORS.md.

Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

ericmjl · 2023-01-09T02:54:34Z

🚀 Deployed on https://deploy-preview-1229--pyjanitor.netlify.app

codecov · 2023-01-09T03:50:29Z

Codecov Report

Merging #1229 (0ae2ff1) into dev (9593074) will increase coverage by 15.30%.
The diff coverage is 100.00%.

❗ Current head 0ae2ff1 differs from pull request most recent head 5aff13b. Consider uploading reports for the commit 5aff13b to get more accurate results

@@             Coverage Diff             @@
##              dev    #1229       +/-   ##
===========================================
+ Coverage   82.51%   97.82%   +15.30%     
===========================================
  Files          78       79        +1     
  Lines        3770     3814       +44     
===========================================
+ Hits         3111     3731      +620     
+ Misses        659       83      -576

ericmjl

I'm good to go with the changes here, @samukweku. Thank you for taking the time and effort to put in this PR for the mutate function!

Can I check, if we go down this path with the mutate function, is there anything stopping us from replicating the dplyr API here? When studying the API design implemented in this PR, I noticed that it deviated from dplyr substantially, so I wanted to make sure this difference was intentional.

There are some comments that I hope we can address. Once you've addressed them, I am happy to merge!

By the way, for the future, would you be okay keeping the PR compact? There were a lot of lines changed that weren't related to the mutate function that also got snuck in. I found it a bit tedious to juggle mentally. The hardest part for me was deciding whether a line change was substantially related to mutate or not. I think I could have been more efficient with the review if all changes were strictly related to mutate only here.

ericmjl · 2023-01-28T21:53:21Z

janitor/functions/mutate.py

+    The computation should return a 1-D array like object
+    that is the same length as `df` or a scalar
+    that can be broadcasted to the same length as `df`.


@samukweku I had to do a double-take on this sentence. I think it may be improved by indicating how this knowledge affects the user's use of the mutate API. You may want to add in a "for example..." to help here. What do you think?

ericmjl · 2023-01-28T21:54:16Z

tests/functions/test_mutate.py

+
+@pytest.mark.functions
+def test_empty_args(dataframe):
+    """Test output if args is empty"""


Suggested change

"""Test output if args is empty"""

"""Ensure that dataframe is not mutated if args is empty"""

ericmjl · 2023-01-28T21:55:52Z

janitor/functions/mutate.py

+    that is the same length as `df` or a scalar
+    that can be broadcasted to the same length as `df`.
+
+    The argument should be of the form `(columns, func, names_glue)`;


Suggested change

The argument should be of the form `(columns, func, names_glue)`;

The argument should be of the form `(columns, mutation_func, names_glue)`;

I changed the name here b/c I thought it would be clearer. However, this is mostly cosmetic (IMO), and it does break the consistency in how we use func to denote a function, so it's okay to not accept this suggestion.

ericmjl · 2023-01-28T22:22:39Z

janitor/functions/mutate.py

+    [`select_columns`][janitor.functions.select.select_columns]
+    syntax for flexibility.
+    The function `func` should be a string
+    (which is dispatched to `pd.Series.transform`),


If func should be a string, I think it may be worth referencing a source of truth for what strings are valid. For example, I wouldn't be able to pass in "means" but "mean" is valid.

ericmjl · 2023-01-29T03:27:56Z

janitor/functions/collapse_levels.py

@@ -72,6 +72,8 @@ class  max_speed    type
    if not isinstance(df.columns, pd.MultiIndex):
        return df

+    df = df[:]


@samukweku what does this syntax do for us here? Should it be df = df.copy() instead?

We are mutating the columns in this case, so a slice suffices. Pandas creates a new index altogether, which does not affect the original dataframe.

samukweku · 2023-01-29T04:04:43Z

my apologies on the line changes ... I'll def keep the PR more compact in future iterations ... will def review my git skills again

…1220) * names_pattern supports named_groups * improve tests * fix format in example * add another example for named groups * add details about names_pattern and named groups * add support for dictionary for names_pattern * changelog * changelog * Update CHANGELOG.md * add test for _ as placeholder for .value * change logic for replacing _ with .value * format example output * Version added * Update pivot.py * update docs based on feedback Co-authored-by: Eric Ma <[email protected]>

* minor fix for drop_constant_columns * minor fix for get_dupes * minor fix for collapse_levels, primarily for speed * fix test fails * fix test fails * vectorise collapse_levels some more for performance sake * allow for mutation * leave collapse_levels as is * Update test_collapse_levels.py * Update test_collapse_levels.py * Update test_collapse_levels.py * restor collapse_levels to before * shortcut if all entries are strings in a list in a select call * use get_indexer_for for lists that contain only strings in select * make more robust by checking on scalar, instead of just strings * improve comments * rebase * more edits * remove extra check * shortcut for * * exclude api/utils from mkdocs * exclude api/utils from mkdocs * simplify move * avoid mutation in collapse_levels * make move more robust with select syntax * docs * fix docstring * replicate fill_empty in impute - reduce duplication * add tests * fix doctest * fix docstrings * defer copy in pivot_wider to pd.pivot * fix np.bool8 deprecation * simplify dtype column selection * fix warning msg output for change_type * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rebase * expose _select_index * add parameters * use get_index_labels where possible * add test for multiple columns * make column selection more robust for sequences * add test for set/dict selection * add test for move - both source and target are lists * exclude utils from docs * fix test fails --------- Co-authored-by: samuel.oranyeli <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

samukweku requested review from ericmjl, Zeroto521 and thatlittleboy January 9, 2023 02:48

samukweku self-assigned this Jan 9, 2023

samukweku changed the title ~~[ENH] Mutate function~~ [WIP] Mutate function Jan 9, 2023

samukweku changed the title ~~[WIP] Mutate function~~ [ENH] Mutate function Jan 10, 2023

samukweku force-pushed the samukweku/mutate branch from 9ceedd0 to 8c96eb5 Compare January 12, 2023 12:59

samukweku changed the title ~~[ENH] Mutate function~~ [WIP] Mutate function Jan 16, 2023

ericmjl requested changes Jan 29, 2023

View reviewed changes

samukweku marked this pull request as draft January 29, 2023 04:05

samukweku added 17 commits February 1, 2023 20:28

skeleton for mutate

1b9afd1

add test for dict

1479368

check for nested dictionary

3f4a07b

add tests for dict output

dedcf19

clearer test function names

a0ffed2

add tests for groupby

7d3adbc

tuple length check

7a40891

name check tuple

8f0681e

function check tuple

9ce7c95

parametrize where possible

84f5c5a

implement tuple func

4c62a18

rework logic

3ed98ef

rework error checks

7ae6f79

add examples to docs

ba7d132

changelog

eb1d3ad

typing changes

fb5e908

docs update

4904e89

samukweku and others added 18 commits February 1, 2023 20:32

fix doctest failure

40ee71a

fix doctest failures

dc032e0

fix more doctest failures

8969973

fix more doctest failures

e22cedb

reword docs

9540581

example for multiple by parameters

c86f35d

delete summarize.py

85f11d9

avoid multiindex columns

1ea1988

remove multiindex column restriction

fed4ac7

fix format in docs

a9dc8cf

restrict mutate to only tuples

4ce6931

dictionary for by

5f5b7e0

fix wrong name in docs

4633a55

remove irrelevant code

b68cf87

tuple length check

d3b09ad

changelog

199631e

samukweku force-pushed the samukweku/mutate branch from ad9c014 to 199631e Compare February 1, 2023 09:40

samukweku added 2 commits February 1, 2023 20:57

updates

0ae5762

updates

5aff13b

samukweku closed this Feb 22, 2023

samukweku deleted the samukweku/mutate branch February 22, 2023 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Mutate function #1229

[WIP] Mutate function #1229

samukweku commented Jan 9, 2023 •

edited

Loading

ericmjl commented Jan 9, 2023 •

edited

Loading

codecov bot commented Jan 9, 2023 •

edited

Loading

ericmjl left a comment

ericmjl Jan 28, 2023

ericmjl Jan 28, 2023

ericmjl Jan 28, 2023

ericmjl Jan 28, 2023

ericmjl Jan 29, 2023

samukweku Jan 29, 2023

samukweku commented Jan 29, 2023

	"""Test output if args is empty"""
	"""Ensure that dataframe is not mutated if args is empty"""

	The argument should be of the form `(columns, func, names_glue)`;
	The argument should be of the form `(columns, mutation_func, names_glue)`;

[WIP] Mutate function #1229

[WIP] Mutate function #1229

Conversation

samukweku commented Jan 9, 2023 • edited Loading

PR Description

PR Checklist

Automatic checks

Relevant Reviewers

ericmjl commented Jan 9, 2023 • edited Loading

codecov bot commented Jan 9, 2023 • edited Loading

Codecov Report

ericmjl left a comment

Choose a reason for hiding this comment

ericmjl Jan 28, 2023

Choose a reason for hiding this comment

ericmjl Jan 28, 2023

Choose a reason for hiding this comment

ericmjl Jan 28, 2023

Choose a reason for hiding this comment

ericmjl Jan 28, 2023

Choose a reason for hiding this comment

ericmjl Jan 29, 2023

Choose a reason for hiding this comment

samukweku Jan 29, 2023

Choose a reason for hiding this comment

samukweku commented Jan 29, 2023

samukweku commented Jan 9, 2023 •

edited

Loading

ericmjl commented Jan 9, 2023 •

edited

Loading

codecov bot commented Jan 9, 2023 •

edited

Loading