Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test polars support #826

Merged
merged 26 commits into from
Nov 22, 2023
Merged

Test polars support #826

merged 26 commits into from
Nov 22, 2023

Conversation

TheooJ
Copy link
Contributor

@TheooJ TheooJ commented Nov 14, 2023

What is this PR addressing?
Testing polars inputs (closing issue #825)

Method

  • I didn't duplicate tests but rather used @pytest.mark.parametrize with the two modules.
  • If Polars isn’t available, functions are only testing for Pandas dataframes as discussed in TST polars support for deduplicate #785
  • When both modules have an assert_frame_equal testing function, I’ve created lists of tuples to store both methods and call the needed one in the testing functions.

Example

from pandas.testing import assert_frame_equal
from skrub._dataframe._polars import POLARS_SETUP

MODULES = [pd]  # This is used when we don't need assert_frame_equal
ASSERT_TUPLES = [(pd, assert_frame_equal)]

if POLARS_SETUP:
    import polars as pl
    from polars.testing import assert_frame_equal as assert_frame_equal_pl

    MODULES.append(pl)
    ASSERT_TUPLES.append((pl, assert_frame_equal_pl))

# This was modified to either return a pd or pl DataFrame
def get_mixed_datetime_format(px, as_array=False):
    """px should either be the Pandas or Polars module."""
    df = px.DataFrame(
        dict(
            a=[
                "2022-10-15",
                "2021-12-25",
                "2020-05-18",
                "2019-10-15 12:00:00",
            ]
        )
    )
    if as_array:
        return df.to_numpy()
    return df

# When available, this is testing both pd and pl modules
@pytest.mark.parametrize("px, assert_frame_equal_", ASSERT_TUPLES)
def test_indempotency(px, assert_frame_equal_):
    df = get_mixed_datetime_format(px)
    df_dt = to_datetime(df)
    df_dt_2 = to_datetime(df_dt)
    assert_frame_equal_(df_dt, df_dt_2)

    X_trans = DatetimeEncoder().fit_transform(df)
    X_trans_2 = DatetimeEncoder().fit_transform(df_dt)
    assert_array_equal(X_trans, X_trans_2)

Comments/Discussion

  • Some test functions specified that they returned → None, some didn’t. I removed all occurrences of these in the tests I’ve covered
  • Some functions created dataframes using pd.DataFrame([[1, 3], [2, 4]], columns=['a', 'b']). This method isn’t supported by Polars, so I’ve replaced it by pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) when applicable
  • Some dataframes that are being tested used pd.NA. I used pl.Null for the Polars equivalent
  • When the test failed in polars, i put
    def test_working_for_pd_but_not_pl(px):
        if px is pl:
            pytest.xfail(reason="Some useful message for debugging later")
    
  • The errors most commonly raised are that pl.DataFrames don’t have a reset_index() method, and that pl.DataFrame.drop() doesn't have an axis argument

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, @TheooJ, on this PR! Here are some first remarks :)

skrub/tests/test_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/tests/test_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/tests/test_deduplicate.py Outdated Show resolved Hide resolved
skrub/tests/test_gap_encoder.py Outdated Show resolved Hide resolved
skrub/tests/test_fuzzy_join.py Show resolved Hide resolved
skrub/tests/test_interpolation_join.py Show resolved Hide resolved
skrub/tests/test_interpolation_join.py Outdated Show resolved Hide resolved
skrub/tests/test_similarity_encoder.py Outdated Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Théo, I think we're almost there.

Could you replace:

px.__name__ == "polars"

by a new function in skrub._dataframe._namespace.py:

def is_namespace_polars(px):
    if "polars" not in sys.modules:
        return False
    
    import polars as pl

    return px is pl

So that in your tests you would simply call:

if is_namespace_polars(px):

Let's also define a similar (but simpler) function for Pandas:

def is_namespace_pandas(px):
    return px is pd

@jeromedockes
Copy link
Member

out of curiosity in is_namespace_polars why is it necessary to do those checks rather than using __name__?

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Nov 21, 2023

To stay consistent with what we already have (e.g., is_polars), mostly? Kind of a nitpick, I agree. Also, this is something we might reuse. But let's revert if you prefer, I don't have a strong opinion on this.

@jeromedockes
Copy link
Member

my question was not about using a named function vs inlining it, but rather about checking __name__ vs checking if polars is in sys.modules and then importing it and comparing with is

I was just interested to know if there is some situation you have encountered where comparing __name__ == 'polars' would not yield the expected result

@Vincent-Maladiere
Copy link
Member

I saw similar patterns in scikit-learn to avoid weird behaviors like duck-typing. Happy to reverse if this brings unnecessary complexity, though.

@jeromedockes
Copy link
Member

jeromedockes commented Nov 21, 2023 via email

@Vincent-Maladiere
Copy link
Member

I would be in favor of putting it in a private module like testing_utils and starting with the simplest implementation

Let's apply this advice using px.__name__ == "polars", @TheooJ, then we can merge.

@Vincent-Maladiere
Copy link
Member

@jeromedockes, for the sake of argument, this is also used in this exact form in the array API. The get_namespace function is used by scikit-learn. This is initially what motivated this implementation in skrub. You can ask them why they deemed this necessary.

@TheooJ
Copy link
Contributor Author

TheooJ commented Nov 21, 2023

I'm rewriting the function as

def is_namespace_polars(px):
    return px.__name__ == "polars"

in skrub/_utils.py

@Vincent-Maladiere
Copy link
Member

Please put it in a new file skrub/_dataframe/_test_utils.py

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @TheooJ !

@jeromedockes
Copy link
Member

maybe testing_utils.py so it doesn't get picked up by the test discovery rules

@jeromedockes
Copy link
Member

this is also used in this exact form in the array API

AFAICT this is also for checking a dataframe rather than a module directly?

@TheooJ
Copy link
Contributor Author

TheooJ commented Nov 21, 2023

maybe testing_utils.py so it doesn't get picked up by the test discovery rules

I think it discovers test_*.py or *_test.py files but not _test_*.py files

@jeromedockes
Copy link
Member

jeromedockes commented Nov 21, 2023 via email

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot @TheooJ ! it's super useful. and all those xfails show us what still needs to be done for polars support :)

skrub/_dataframe/_test_utils.py Outdated Show resolved Hide resolved
from polars.testing import assert_frame_equal as assert_frame_equal_pl

MODULES.append(pl)
ASSERT_TUPLES.append((pl, assert_frame_equal_pl))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's leave it for another pr but I think we should reorganize this a bit to put this kind of setup either in the _test_utils.py or in conftest.py and avoid duplicating it for each test module

@jeromedockes
Copy link
Member

@TheooJ sorry about that but we now have a small git merge conflict. could you fix that and then I will merge it

@jeromedockes
Copy link
Member

thanks a lot!

@jeromedockes jeromedockes merged commit b787db9 into skrub-data:main Nov 22, 2023
18 of 19 checks passed
@Vincent-Maladiere
Copy link
Member

AFAICT this is also for checking a dataframe rather than a module directly?

To check a module directly, they use sys.module as well:
https://github.com/data-apis/array-api-compat/blob/51d7832f048f58d47237d35fb1e7561d8ebfe259/array_api_compat/common/_helpers.py#L119

@jeromedockes jeromedockes mentioned this pull request Nov 23, 2023
@TheooJ TheooJ deleted the test_polars branch December 20, 2023 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants