-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC add example for Cramer V for column_associations #1186
Changes from all commits
e61b8d7
262fde6
dd43ba8
26b073e
f28299b
a28677b
8826765
b48168d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,14 +18,15 @@ def column_associations(df): | |
|
||
The result is returned as a dataframe with columns: | ||
|
||
['left_column_name', 'left_column_idx', 'right_column_name', | ||
'right_column_idx', 'cramer_v'] | ||
``['left_column_name', 'left_column_idx', 'right_column_name', | ||
'right_column_idx', 'cramer_v']`` | ||
|
||
As the function is commutative, each pair of columns appears only once | ||
(either col_1, col_2 or col_2, col_1 but not both). The results are sorted | ||
(either ``col_1``, ``col_2`` or ``col_2``, ``col_1`` but not both). | ||
The results are sorted | ||
from most associated to least associated. | ||
|
||
To compute the Cramer V statistic, all columns are discretized. Numeric | ||
To compute the Cramer's V statistic, all columns are discretized. Numeric | ||
columns are binned with 10 bins. For categorical columns, only the 10 most | ||
frequent categories are considered. In both cases, nulls are treated as a | ||
separate category, ie a separate row in the contingency table. Thus | ||
|
@@ -41,6 +42,55 @@ def column_associations(df): | |
------- | ||
dataframe | ||
The computed associations. | ||
|
||
Notes | ||
----- | ||
Cramér's V is a measure of association between two nominal variables, | ||
giving a value between 0 and +1 (inclusive). | ||
|
||
* `Cramer's V <https://en.wikipedia.org/wiki/Cramér%27s_V>`_ | ||
|
||
Examples | ||
-------- | ||
>>> import numpy as np | ||
>>> import pandas as pd | ||
>>> import skrub | ||
>>> pd.set_option('display.width', 200) | ||
>>> pd.set_option('display.max_columns', 10) | ||
>>> pd.set_option('display.precision', 4) | ||
>>> rng = np.random.default_rng(33) | ||
>>> df = pd.DataFrame({f"c_{i}": rng.random(size=20)*10 for i in range(5)}) | ||
>>> df["c_str"] = [f"val {i}" for i in range(df.shape[0])] | ||
>>> df.shape | ||
reshamas marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(20, 6) | ||
>>> df.head() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same for this line and the one past There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added output for associations There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like the output is too wide (> 80 chars):
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's something to do with different configurations for different pandas versions, tests fail on the min reqs configuration 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding Not sure how we can get around this, @jeromedockes any idea? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so I think this should be ok for the display:
and then at the end of the example
|
||
c_0 c_1 c_2 c_3 c_4 c_str | ||
0 4.4364 4.0114 6.9271 7.0970 4.8913 val 0 | ||
1 5.6849 0.7192 7.6430 4.6441 2.5116 val 1 | ||
2 9.0810 9.4011 1.9257 5.7429 6.2358 val 2 | ||
3 2.5425 2.9678 9.7801 9.9879 6.0709 val 3 | ||
4 5.8878 9.3223 5.3840 7.2006 2.1494 val 4 | ||
>>> associations = skrub.column_associations(df) | ||
>>> associations # doctest: +SKIP | ||
left_column_name left_column_idx right_column_name right_column_idx cramer_v | ||
0 c_3 3 c_str 5 0.8215 | ||
1 c_1 1 c_4 4 0.8215 | ||
2 c_0 0 c_1 1 0.8215 | ||
3 c_2 2 c_str 5 0.7551 | ||
4 c_0 0 c_str 5 0.7551 | ||
5 c_0 0 c_3 3 0.7551 | ||
6 c_1 1 c_3 3 0.6837 | ||
7 c_0 0 c_4 4 0.6837 | ||
8 c_4 4 c_str 5 0.6837 | ||
9 c_3 3 c_4 4 0.6053 | ||
10 c_2 2 c_3 3 0.6053 | ||
11 c_1 1 c_str 5 0.6053 | ||
12 c_0 0 c_2 2 0.6053 | ||
13 c_2 2 c_4 4 0.5169 | ||
14 c_1 1 c_2 2 0.4122 | ||
>>> pd.reset_option('display.width') | ||
>>> pd.reset_option('display.max_columns') | ||
>>> pd.reset_option('display.precision') | ||
""" | ||
return _stack_symmetric_associations(_cramer_v_matrix(df), df) | ||
|
||
|
@@ -173,7 +223,7 @@ def _contingency_table(encoded): | |
|
||
|
||
def _compute_cramer(table, n_samples): | ||
"""Compute the Cramer V statistic given a contingency table. | ||
"""Compute the Cramer's V statistic given a contingency table. | ||
|
||
The input is the table computed by ``_contingency_table`` with shape | ||
(n cols, n cols, n bins, n bins). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding the explanation and the link that's super useful!