Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC add example for Cramer V for column_associations #1186

Merged
merged 8 commits into from
Dec 7, 2024

Conversation

reshamas
Copy link
Contributor

@reshamas reshamas commented Dec 7, 2024

add example for Cramer V

Will be rendered here in documentation:
https://skrub-data.org/stable/reference/generated/skrub.column_associations.html

Towards #1181

@jeromedockes
Copy link
Member

Hi @reshamas thanks for opening this! such examples directly in the reference documentation are super useful to understand how to use a library.

Your example is great! I think >>> df.shape() needs to be replaced with >>> df.shape to fix the error. to run the doctests on your machine, which can be easier than waiting for the CI to run you can use the command pytest skrub/_column_associations.py

@jeromedockes
Copy link
Member

Also I see you add the example section to the _compute_cramer function. This one is a "private" helper function used only within the _column_associations module itself. you can tell because its name starts with an underscore, and it does not have an entry in the skrub documentation. Could we move the Examples section that you wrote to the docstring of the column_associations function here? thanks!

@GaelVaroquaux
Copy link
Member

Super good idea to have done this as a docstring example

>>> df = pd.DataFrame({f"c_{i}": rng.random(size=20)*10 for i in range(5)})
>>> df["c_str"] = [f"val {i}" for i in range(df.shape[0])]
>>> df.shape
>>> df.head()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for this line and the one past associations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added output for associations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the output is too wide (> 80 chars):

       left_column_name  left_column_idx right_column_name  right_column_idx  cramer_v
    0               c_3                3             c_str                 5    0.8215
    1               c_1                1               c_4                 4    0.8215
    2               c_0                0               c_1                 1    0.8215
    3               c_2                2             c_str                 5    0.7551
    4               c_0                0             c_str                 5    0.7551
    5               c_0                0               c_3                 3    0.7551
    6               c_1                1               c_3                 3    0.6837
    7               c_0                0               c_4                 4    0.6837
    8               c_4                4             c_str                 5    0.6837
    9               c_3                3               c_4                 4    0.6053
    10              c_2                2               c_3                 3    0.6053
    11              c_1                1             c_str                 5    0.6053
    12              c_0                0               c_2                 2    0.6053
    13              c_2                2               c_4                 4    0.5169
    14              c_1                1               c_2                 2    0.4122

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something to do with different configurations for different pandas versions, tests fail on the min reqs configuration 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding >>> pd.set_option('expand_frame_repr', False) seems to print it properly, however now there's a problem with the ordering of the rows, which for whatever reason changes between pandas 1.5 and pandas 2.2.

Not sure how we can get around this, @jeromedockes any idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think this should be ok for the display:

    >>> pd.set_option('display.width', 200)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.precision', 4)

and then at the end of the example

    >>> pd.reset_option('display.width')
    >>> pd.reset_option('display.max_columns')
    >>> pd.reset_option('display.precision')

@rcap107
Copy link
Contributor

rcap107 commented Dec 7, 2024

Hi @reshamas, thanks for contributing! I added a few comments that should fix the tests.

@rcap107
Copy link
Contributor

rcap107 commented Dec 7, 2024

You might also want to run pytest skrub/tests/test_docstrings.py to check everything runs properly.

@jeromedockes
Copy link
Member

sorry for the delay I'll look at it now

@jeromedockes
Copy link
Member

I think we should have a little function in the datasets module like toy_dataframe() that we can use in the reference examples so simplify the setup code and so we get the same every time

also we probably can set pandas display options before running the doctests and perhaps reset them before and after running each module's doctests

I don't want this stuff to block this PR though, because it really helps clarify what the output of the column_associations contains. I suggest we skip this one doctest output, as we do in a few other places where ensuring the output is exactly consistent would make the example too complicated, defeating its documentation purpose.

skrub/_column_associations.py Outdated Show resolved Hide resolved
skrub/_column_associations.py Outdated Show resolved Hide resolved
skrub/_column_associations.py Outdated Show resolved Hide resolved
Comment on lines +47 to +51
Cramér's V is a measure of association between two nominal variables,
giving a value between 0 and +1 (inclusive).

* `Cramer's V <https://en.wikipedia.org/wiki/Cramér%27s_V>`_

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the explanation and the link that's super useful!

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!! thanks a ton @reshamas . this function was moved to the public api in the last release as we realized it can be very useful on its own ... but it was lacking a proper documentation and examples! battling with pandas output formatting details across multiple versions of pandas and python is not fun but thanks a lot for pushing through :)

@jeromedockes jeromedockes merged commit da7dd9c into skrub-data:main Dec 7, 2024
25 checks passed
@reshamas
Copy link
Contributor Author

reshamas commented Dec 7, 2024

Thank you @jeromedockes and @rcap107. It was a very good learning experience.

@rcap107
Copy link
Contributor

rcap107 commented Dec 9, 2024

Thank you @jeromedockes and @rcap107. It was a very good learning experience.

Likewise, thanks for contributing!

@jeromedockes
Copy link
Member

yep, thanks a lot! I know you had also started working on adding Pearson correlations to the output, so if you have any questions about that don't hesitate to reach us here on github or on the skrub discord server: https://discord.gg/ABaPnm7fDC

TheooJ pushed a commit that referenced this pull request Dec 11, 2024
* bump version after 0.4.0 (#1162)

* DOC Correct typo in TableReport docstring (#1168)

* Add codespell support (#1126)

* add python 3.13 (#1170)

* add sections for the next release's changelogs (#1183)

* Add a verbosity parameter to TableReport to control the printing of messages in stdout when opening summary report (#1182)

* add verbose parameter to print_display to  toggle on or off printing of progress messages when generating table report (#1188)

* DOC add example for Cramer V for column_associations (#1186)

* ENH adding alias "regression" and "classification" (#1180)

* Bump codecov/codecov-action from 5.0.7 to 5.1.1 in the actions group (#1191)

* remove use of matplotlib.rc_context (#1172)

* MAINT adapt for scikit-learn 1.6 (#1135)

* Revert "MAINT adapt for scikit-learn 1.6 (#1135)" (#1194)

This reverts commit 18af508.

* MAINT compatibility sklearn 1.6.0 (#1169)

* MAINT fix nightly builds (#1193)

* Updating contributing doc to add more detail on how to start  (#1185)

* FIX better display of object columns with multi-line repr in tablereport (#1196)

* update changelog

---------

Co-authored-by: Matt J. <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: priscilla baah <[email protected]>
Co-authored-by: Reshama Shaikh <[email protected]>
Co-authored-by: mrastgoo <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Riccardo Cappuzzo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants