Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC add example for Cramer V for column_associations #1186

Merged
merged 8 commits into from
Dec 7, 2024
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 55 additions & 5 deletions skrub/_column_associations.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ def column_associations(df):

The result is returned as a dataframe with columns:

['left_column_name', 'left_column_idx', 'right_column_name',
'right_column_idx', 'cramer_v']
`['left_column_name', 'left_column_idx', 'right_column_name',
'right_column_idx', 'cramer_v']`
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved

As the function is commutative, each pair of columns appears only once
(either col_1, col_2 or col_2, col_1 but not both). The results are sorted
(either `col_1`, `col_2` or `col_2`, `col_1` but not both). The results are sorted
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
from most associated to least associated.

To compute the Cramer V statistic, all columns are discretized. Numeric
To compute the Cramer's V statistic, all columns are discretized. Numeric
columns are binned with 10 bins. For categorical columns, only the 10 most
frequent categories are considered. In both cases, nulls are treated as a
separate category, ie a separate row in the contingency table. Thus
Expand All @@ -41,6 +41,55 @@ def column_associations(df):
-------
dataframe
The computed associations.

Notes
-----
Cramér's V is a measure of association between two nominal variables,
giving a value between 0 and +1 (inclusive).

* `Cramer's V <https://en.wikipedia.org/wiki/Cramér%27s_V>`_

Comment on lines +48 to +52
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the explanation and the link that's super useful!

Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> import skrub
>>> pd.set_option('display.width', 200)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.precision', 4)
>>> rng = np.random.default_rng(33)
>>> df = pd.DataFrame({f"c_{i}": rng.random(size=20)*10 for i in range(5)})
>>> df["c_str"] = [f"val {i}" for i in range(df.shape[0])]
>>> df.shape
reshamas marked this conversation as resolved.
Show resolved Hide resolved
(20, 6)
>>> df.head()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for this line and the one past associations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added output for associations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the output is too wide (> 80 chars):

       left_column_name  left_column_idx right_column_name  right_column_idx  cramer_v
    0               c_3                3             c_str                 5    0.8215
    1               c_1                1               c_4                 4    0.8215
    2               c_0                0               c_1                 1    0.8215
    3               c_2                2             c_str                 5    0.7551
    4               c_0                0             c_str                 5    0.7551
    5               c_0                0               c_3                 3    0.7551
    6               c_1                1               c_3                 3    0.6837
    7               c_0                0               c_4                 4    0.6837
    8               c_4                4             c_str                 5    0.6837
    9               c_3                3               c_4                 4    0.6053
    10              c_2                2               c_3                 3    0.6053
    11              c_1                1             c_str                 5    0.6053
    12              c_0                0               c_2                 2    0.6053
    13              c_2                2               c_4                 4    0.5169
    14              c_1                1               c_2                 2    0.4122

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something to do with different configurations for different pandas versions, tests fail on the min reqs configuration 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding >>> pd.set_option('expand_frame_repr', False) seems to print it properly, however now there's a problem with the ordering of the rows, which for whatever reason changes between pandas 1.5 and pandas 2.2.

Not sure how we can get around this, @jeromedockes any idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think this should be ok for the display:

    >>> pd.set_option('display.width', 200)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.precision', 4)

and then at the end of the example

    >>> pd.reset_option('display.width')
    >>> pd.reset_option('display.max_columns')
    >>> pd.reset_option('display.precision')

c_0 c_1 c_2 c_3 c_4 c_str
0 4.4364 4.0114 6.9271 7.0970 4.8913 val 0
1 5.6849 0.7192 7.6430 4.6441 2.5116 val 1
2 9.0810 9.4011 1.9257 5.7429 6.2358 val 2
3 2.5425 2.9678 9.7801 9.9879 6.0709 val 3
4 5.8878 9.3223 5.3840 7.2006 2.1494 val 4
>>> associations = skrub.column_associations(df)
>>> associations
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
left_column_name left_column_idx right_column_name right_column_idx cramer_v
0 c_3 3 c_str 5 0.8215
1 c_1 1 c_4 4 0.8215
2 c_0 0 c_1 1 0.8215
3 c_2 2 c_str 5 0.7551
4 c_0 0 c_str 5 0.7551
5 c_0 0 c_3 3 0.7551
6 c_1 1 c_3 3 0.6837
7 c_0 0 c_4 4 0.6837
8 c_4 4 c_str 5 0.6837
9 c_3 3 c_4 4 0.6053
10 c_2 2 c_3 3 0.6053
11 c_1 1 c_str 5 0.6053
12 c_0 0 c_2 2 0.6053
13 c_2 2 c_4 4 0.5169
14 c_1 1 c_2 2 0.4122
>>> pd.reset_option('display.width')
>>> pd.reset_option('display.max_columns')
>>> pd.reset_option('display.precision')
"""
return _stack_symmetric_associations(_cramer_v_matrix(df), df)

Expand Down Expand Up @@ -173,13 +222,14 @@ def _contingency_table(encoded):


def _compute_cramer(table, n_samples):
"""Compute the Cramer V statistic given a contingency table.
"""Compute the Cramer's V statistic given a contingency table.

The input is the table computed by ``_contingency_table`` with shape
(n cols, n cols, n bins, n bins).

This returns the symmetric matrix with shape (n cols, n cols) where entry
i, j contains the statistic for column i x column j.

jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
"""
marginal_0 = table.sum(axis=-2)
marginal_1 = table.sum(axis=-1)
Expand Down
Loading