You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today all categorical columns are included in the calculation of mutual information scores. This is normally fine except when the number of unique categories in the categorical column increases to a large number. This was documented by @rpeck in JupyterHub's "mutual information experiments". Despite sampling just 100,000 rows, this is still a pertinent issue.
Currently Adjusted Mutual Information is used when calculating the Mutual Information score in _get_dependence_dict. This is really useful for smaller datasets but larger ones take longer to run with it. We currently don't use Normalized Mutual Information which does help with larger datasets.
This issue is for investigating a better way to deal with datasets that are hundreds of thousands of observations long, with one or more categorical columns that have over 10,000 unique categories.
Points to cover include:
Developing a cutoff point for identifying such offending datasets i.e. how many unique categories or how many unique categories per x number of columns before we run into a problem?
Investigating Normalized Mutual Information as a potential replacement default scoring method for datasets identified as offending.
Considering Normalized Mutual Information as a default replacement for Adjusted Mutual Information.
The text was updated successfully, but these errors were encountered:
Today all categorical columns are included in the calculation of mutual information scores. This is normally fine except when the number of unique categories in the categorical column increases to a large number. This was documented by @rpeck in JupyterHub's "mutual information experiments". Despite sampling just 100,000 rows, this is still a pertinent issue.
Currently Adjusted Mutual Information is used when calculating the Mutual Information score in
_get_dependence_dict
. This is really useful for smaller datasets but larger ones take longer to run with it. We currently don't use Normalized Mutual Information which does help with larger datasets.This issue is for investigating a better way to deal with datasets that are hundreds of thousands of observations long, with one or more categorical columns that have over 10,000 unique categories.
Points to cover include:
The text was updated successfully, but these errors were encountered: