Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Mutate function #1229

Closed
wants to merge 37 commits into from
Closed

[WIP] Mutate function #1229

wants to merge 37 commits into from

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Jan 9, 2023

PR Description

Please describe the changes proposed in the pull request:

  • replace transform_column and transform_columns functions
  • transformation is with a tuple
  • support for multiple column transformations
  • support for transformations in the presence of a grouping
  • inspiration from dplyr's across and rdatatable's SD

At its core, it is nothing more than an elaborate for loop. All the hard work is passed on to Pandas. user should only use this if pd.assign is not enough

Examples:

   rater_id  being_rated_id  combine_id  avg_jump  avg_run  avg_swim category
0       100             200      100200         3        3         2    heats
1       100             200      100200         4        4         1    heats
2       101             200      101200         1        1         2   finals
3       101             200      101200         2        3         2   finals
4       102             201      102201         3        2         3    heats
5       103             202      103202         4        4         4   finals

# transform with a tuple
df.mutate(("avg*", "mean", "{_col}_2"), by=['combine_id', 'category'])
   avg_jump  avg_run  avg_swim  combine_id category  avg_jump_2  avg_run_2  avg_swim_2
0         3        3         2      100200    heats         3.5        3.5         1.5
1         4        4         1      100200    heats         3.5        3.5         1.5
2         1        1         2      101200   finals         1.5        2.0         2.0
3         2        3         2      101200   finals         1.5        2.0         2.0
4         3        2         3      102201    heats         3.0        2.0         3.0
5         4        4         4      103202   finals         4.0        4.0         4.0

This PR resolves #1226 and #1178.

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl
Copy link
Member

ericmjl commented Jan 9, 2023

@codecov
Copy link

codecov bot commented Jan 9, 2023

Codecov Report

Merging #1229 (0ae2ff1) into dev (9593074) will increase coverage by 15.30%.
The diff coverage is 100.00%.

❗ Current head 0ae2ff1 differs from pull request most recent head 5aff13b. Consider uploading reports for the commit 5aff13b to get more accurate results

@@             Coverage Diff             @@
##              dev    #1229       +/-   ##
===========================================
+ Coverage   82.51%   97.82%   +15.30%     
===========================================
  Files          78       79        +1     
  Lines        3770     3814       +44     
===========================================
+ Hits         3111     3731      +620     
+ Misses        659       83      -576     

@samukweku samukweku changed the title [ENH] Mutate function [WIP] Mutate function Jan 9, 2023
@samukweku samukweku changed the title [WIP] Mutate function [ENH] Mutate function Jan 10, 2023
@samukweku samukweku changed the title [ENH] Mutate function [WIP] Mutate function Jan 16, 2023
Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good to go with the changes here, @samukweku. Thank you for taking the time and effort to put in this PR for the mutate function!

Can I check, if we go down this path with the mutate function, is there anything stopping us from replicating the dplyr API here? When studying the API design implemented in this PR, I noticed that it deviated from dplyr substantially, so I wanted to make sure this difference was intentional.

There are some comments that I hope we can address. Once you've addressed them, I am happy to merge!

By the way, for the future, would you be okay keeping the PR compact? There were a lot of lines changed that weren't related to the mutate function that also got snuck in. I found it a bit tedious to juggle mentally. The hardest part for me was deciding whether a line change was substantially related to mutate or not. I think I could have been more efficient with the review if all changes were strictly related to mutate only here.

Comment on lines +28 to +30
The computation should return a 1-D array like object
that is the same length as `df` or a scalar
that can be broadcasted to the same length as `df`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samukweku I had to do a double-take on this sentence. I think it may be improved by indicating how this knowledge affects the user's use of the mutate API. You may want to add in a "for example..." to help here. What do you think?


@pytest.mark.functions
def test_empty_args(dataframe):
"""Test output if args is empty"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Test output if args is empty"""
"""Ensure that dataframe is not mutated if args is empty"""

that is the same length as `df` or a scalar
that can be broadcasted to the same length as `df`.

The argument should be of the form `(columns, func, names_glue)`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The argument should be of the form `(columns, func, names_glue)`;
The argument should be of the form `(columns, mutation_func, names_glue)`;

I changed the name here b/c I thought it would be clearer. However, this is mostly cosmetic (IMO), and it does break the consistency in how we use func to denote a function, so it's okay to not accept this suggestion.

[`select_columns`][janitor.functions.select.select_columns]
syntax for flexibility.
The function `func` should be a string
(which is dispatched to `pd.Series.transform`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If func should be a string, I think it may be worth referencing a source of truth for what strings are valid. For example, I wouldn't be able to pass in "means" but "mean" is valid.

@@ -72,6 +72,8 @@ class max_speed type
if not isinstance(df.columns, pd.MultiIndex):
return df

df = df[:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samukweku what does this syntax do for us here? Should it be df = df.copy() instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are mutating the columns in this case, so a slice suffices. Pandas creates a new index altogether, which does not affect the original dataframe.

@samukweku
Copy link
Collaborator Author

my apologies on the line changes ... I'll def keep the PR more compact in future iterations ... will def review my git skills again

@samukweku samukweku marked this pull request as draft January 29, 2023 04:05
samukweku and others added 18 commits February 1, 2023 20:32
…1220)

* names_pattern supports named_groups

* improve tests

* fix format in example

* add another example for named groups

* add details about names_pattern and named groups

* add support for dictionary for names_pattern

* changelog

* changelog

* Update CHANGELOG.md

* add test for _ as placeholder for .value

* change logic for replacing _ with .value

* format example output

* Version added

* Update pivot.py

* update docs based on feedback

Co-authored-by: Eric Ma <[email protected]>
* minor fix for drop_constant_columns

* minor fix for get_dupes

* minor fix for collapse_levels, primarily for speed

* fix test fails

* fix test fails

* vectorise collapse_levels some more for performance sake

* allow for mutation

* leave collapse_levels as is

* Update test_collapse_levels.py

* Update test_collapse_levels.py

* Update test_collapse_levels.py

* restor collapse_levels to before

* shortcut if all entries are strings in a list in a select call

* use get_indexer_for for lists that contain only strings in select

* make more robust by checking on scalar, instead of just strings

* improve comments

* rebase

* more edits

* remove extra check

* shortcut for *

* exclude api/utils from mkdocs

* exclude api/utils from mkdocs

* simplify move

* avoid mutation in collapse_levels

* make move more robust with select syntax

* docs

* fix docstring

* replicate fill_empty in impute - reduce duplication

* add tests

* fix doctest

* fix docstrings

* defer copy in pivot_wider to pd.pivot

* fix np.bool8 deprecation

* simplify dtype column selection

* fix warning msg output for change_type

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rebase

* expose _select_index

* add parameters

* use get_index_labels where possible

* add test for multiple columns

* make column selection more robust for sequences

* add test for set/dict selection

* add test for move - both source and target are lists

* exclude utils from docs

* fix test fails

---------

Co-authored-by: samuel.oranyeli <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@samukweku samukweku closed this Feb 22, 2023
@samukweku samukweku deleted the samukweku/mutate branch February 22, 2023 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mutate function
2 participants