Allow missing values in `CategoricalMatrix` #281

MartinStancsicsQC · 2023-07-18T12:14:03Z

Checklist

Added a CHANGELOG.rst entry

This PR adds support for missing values in CategoricalMatrix. The default behavior does not change (raise on missing categoricals). However, if a matrix is constructed with the new cat_missing_method parameter set to zero, then missing categorical values are allowed, and are treated as all-zero indicator rows in subsequent method calls. This new behavior is consistent with pandas.get_dummies' way of handling missing values.

Notes about performance/templating functions:

When a categorical matrix has missing values, the *_complex (formerly *_drop_first) suite of functions is used, so this feature has no performance impact in the most important case (no drop_first and no missing values).
Most of the changes on the C++/Cython side are very minor. The only bigger differences are in the _sandwich_cat_dense* and _sandwich_cat_cat* functions, which were not templated by drop_first before, but are now.
Benchmarks show that performance differences between the *_fast and *_complex functions is negligible in most cases. The only exception is CategoricalMatrix.tocsr (and subsequently CategoricalMatrix.multiply), where the _*fast function is indeed faster (by 25% and 10%, respectively, on a 10,000,000 x 5 categorical matrix). This might imply that having two versions of these functions (like the newly templated Categorical._sandwich_cat_dense* and Categorical._sandwich_cat_cat, but maybe even Categorical.sandwich, Categorical.matvec and Categorical.transpose_matvec) may add more complexity than the benefit it brings.

lbittarello · 2023-07-19T08:10:59Z

What about adding the option to treat missing values as category in their own? :)

MartinStancsicsQC · 2023-07-19T08:45:00Z

Yes, that's a great idea, and I think it only takes a few extra lines in the __init__ methods to implement. And I believe CategoricalMatrix would then have full feature parity with pandas.get_dummies. I'll add it this afternoon.

MartinStancsicsQC · 2023-07-19T11:15:48Z

I've added the option to treat missings in categorical columns as their own category. What do you think, @lbittarello?

Couple of notes:

This is implemented as a pre-processing step instead of complicating things on the Cython/C++ side. The only drawback is that we need to make a copy of categorical columns with missing values, but I don't think that is a big issue.
In contrast to pandas.get_dummies, a new category is only added if the column actually contains missing values. I think this is more reasonable for our purposes, as all-zero columns make no sense when estimating a model.
At the moment the name of the missing category is always '(MISSING)' (even when it already exists). It would be simple to make it parametrezable, if you think it's worth the extra __init__ parameter. The name of the missing category is (MISSING) by default, but parametrizable through the cat_missing_name argument. If cat_missing_name is already an existing category in the input vector and the latter also contains missing values, a ValueError is raised.

MatthiasSchmidtblaicherQC

We should clarify under which circumstances we want to have the same behaviour as pd.get_dummies (except for not expanding the dummy matrix), and where we want to deviate. Otherwise, looks great!

src/tabmat/categorical_matrix.py

MartinStancsicsQC · 2023-07-26T09:40:04Z

@MarcAntoineSchmidtQC, let me know if we should merge it into main or aim for the next major release/refactor

src/tabmat/categorical_matrix.py

MarcAntoineSchmidtQC

Amazing. The tests seem to cover a lot. I don't have anything to add here.

MarcAntoineSchmidtQC · 2023-08-11T20:14:28Z

We can merge to main right now. My understanding is that it's fully backward compatible.

* Minimal implementation (tests green) * Remove sum method and rely on np.sum * Force DenseMatrix to always be 2-dimensional * Add __repr__ and __str__ methods * Fix as_mx * Fix ufunc return value * Wrap SparseMatrix, too * Demo of how the ufunc interface can be implemented * Do not subclass csc_matrix * Demonstrate binary ufuncs for sparse * Add tocsc method * Fix type checks * Minor improvements * ufunc support for categoricals * Remove __array_ufunc__ interface * Remove numpy operator mixin * Add hstack function * Add method for unpacking underlying array * Add __matmul__ methods to SparseMatrix * Stricter and more consistent indexing * Be consistent when instantiating from 1d arrays * Add column name metadata to `tabmat` matrices (#278) * Add column name getters * Matrix names are also combined * Add names to constructors * Add indexing support for column names * Remove unnecessary code * Better default column names * Reduce code duplication * Saner defaults * Add convenient getters and setters * Fix indexing * Smarter setter for categorical matrices * Add tests * Fix subsetting with np.newaxis * Remove the walrus :( * Fix test * Fix indexing with np.ix_ * Propagate column names where it makes sense * Fix merge mistake * Add changelog entry * Matrices from formulas (#267) * Add an experimental tabmat materializer class * Nicer way of handling interactions * Have proper column names [skip ci] * Make dummy ordering consistent with pandas [skip ci] * Fix mistake in categorical interactions [skip ci] * Add formulaic to environment files Have not added to the conda recipe yet. Should probably be optional. * Add from_formula constructor * Add some tests * Add more tests * Major refactoring - simplify categorical interactions - NaNs in categoricals should be handled correctly - parity with formulaic in categorical names * Make name formatting custommizable - interaction_separator - categorical_format - intercept_name * Add formulaic to conda recipe * Implement `C()` function to convert to categoricals * Auto-convert strings to categories * Fix C() not working from materializer interface * Add the pandasmaterializer tests from formulaic * Add formulaic to setup.py deps * Implement suggestions from code review * Clean up code - Add docstrings - Add type hints - Rename some classes * Pin formulaic minimum version * Add support for architectures not supported by xsimd (#262) * Release 3.1.9 (#263) * Pre-commit autoupdate (#264) Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> * Add params for density and cardinality thresholds * Skip python 3.6 build * Refactor to avoid circular imports * Interaction of dropped and NA is dropped * Add type hint for context * Add unit tests for interactable vectors * Add more checks * Change argument name * Make C() stateful (remember levels) * Add test for categorizer state * More correct handling of encoding categoricals * Make adding an intercept implicitly parametrizable Default is False * Add na_action parameter to constrictor * Add test for sparse numerical columns * Add option to not add the constant column * Pre-commit autoupdate (#274) * Pre-commit autoupdate (#276) Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> * Bump pypa/gh-action-pypi-publish from 1.8.6 to 1.8.7 (#277) Bumps [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) from 1.8.6 to 1.8.7. - [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases) - [Commits](pypa/gh-action-pypi-publish@v1.8.6...v1.8.7) --- updated-dependencies: - dependency-name: pypa/gh-action-pypi-publish dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (#279) Bumps [pypa/gh-action-pypi-publish](https://github.com/pypa/gh-action-pypi-publish) from 1.8.7 to 1.8.8. - [Release notes](https://github.com/pypa/gh-action-pypi-publish/releases) - [Commits](pypa/gh-action-pypi-publish@v1.8.7...v1.8.8) --- updated-dependencies: - dependency-name: pypa/gh-action-pypi-publish dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump pypa/cibuildwheel from 2.13.1 to 2.14.1 (#280) Bumps [pypa/cibuildwheel](https://github.com/pypa/cibuildwheel) from 2.13.1 to 2.14.1. - [Release notes](https://github.com/pypa/cibuildwheel/releases) - [Changelog](https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md) - [Commits](pypa/cibuildwheel@v2.13.1...v2.14.1) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Minimal implementation (tests green) * Remove sum method and rely on np.sum * Force DenseMatrix to always be 2-dimensional * Add __repr__ and __str__ methods * Fix as_mx * Fix ufunc return value * Wrap SparseMatrix, too * Demo of how the ufunc interface can be implemented * Do not subclass csc_matrix * Improve the performance of `from_pandas` in the case of low-cardinality categoricals (#275) * Improve the performance of `from_pandas` * Update changelog according to review * Add benchmark data to .gitignore (#282) * Demonstrate binary ufuncs for sparse * Add tocsc method * Fix type checks * Minor improvements * ufunc support for categoricals * Remove __array_ufunc__ interface * Remove numpy operator mixin * Add hstack function * Add method for unpacking underlying array * Add __matmul__ methods to SparseMatrix * Stricter and more consistent indexing * Be consistent when instantiating from 1d arrays * Adjust tests to work with v4 * Fix type hints * Add changelog entry * term and column names for formula-based matrices * Fix handling of formula-based names * Add tests for formula-based names --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Martin Stancsics <[email protected]> Co-authored-by: Uwe L. Korn <[email protected]> Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Apply Matthias' suggestions Co-authored-by: Matthias Schmidtblaicher <[email protected]> * Allow missing values in `CategoricalMatrix` (#281) * Add missing support to categoricals * Rename functions * Parametrize missing behavior in constructors * Return a maskedarray from recover_orig * Propagate missing_method when indexing * Add tests * Template all the things! * Privatize has_missing attribute * Add changelog entry * Add option to treat missing values as a category * Update changelog * Raise if the missing category already exists * Add tests for missing name and raise on existing * Don't skip tests (they are fast) * Apply suggestions from review * Fix indxing * Fix intercept name in formulas * Add missing cateegorical functinoality to formulas * Much cooler handlong of missing categoricals * Add changelog entry * Correctly create missing category from model_spec (#297) * pyupgrade 3.9 * make ruff and mypy happy * bump minimum formulaic version (stateful transforms) * add test case with custom cat format * pin formulaic minimum version to 0.6 (#340) * cosmetics * Raise for unseen categories when materializing from an existing `ModelSpec` (#341) * Raise error on unseen levels when materializing * Fix test for unseen categories * Add test for raising on unseen categories * Properly handle missings when checking for unseen * Expand test for unseen missings * Improve attribute name * Add comment about dropping missings in tests for new levels * consistent tense * typo * slightly improve wording * Describe breaking change * improve wording * review comments * add change from #356 * fix * set default context to None * add scope to other test, too * tiny docstring cosmetics * remove duplicate . [skip-ci] * more docstring formatting --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Uwe L. Korn <[email protected]> Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc-Antoine Schmidt <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Martin Stancsics <[email protected]>

MartinStancsicsQC marked this pull request as ready for review July 18, 2023 12:14

MartinStancsicsQC requested review from tbenthompson, MarcAntoineSchmidtQC, xhochy, jtilly and lbittarello as code owners July 18, 2023 12:14

MartinStancsicsQC changed the title ~~Categorical missing~~ Allow missing values in CategoricalMatrix Jul 18, 2023

MatthiasSchmidtblaicherQC reviewed Jul 21, 2023

View reviewed changes

src/tabmat/categorical_matrix.py Outdated Show resolved Hide resolved

src/tabmat/categorical_matrix.py Outdated Show resolved Hide resolved

MartinStancsicsQC requested review from MatthiasSchmidtblaicherQC and MarcAntoineSchmidtQC and removed request for MarcAntoineSchmidtQC July 26, 2023 09:35

MatthiasSchmidtblaicherQC reviewed Jul 26, 2023

View reviewed changes

src/tabmat/categorical_matrix.py Outdated Show resolved Hide resolved

MatthiasSchmidtblaicherQC reviewed Jul 26, 2023

View reviewed changes

src/tabmat/categorical_matrix.py Outdated Show resolved Hide resolved

MartinStancsicsQC force-pushed the categorical-missing branch from 7fff8bf to 19c45be Compare July 26, 2023 11:50

MarcAntoineSchmidtQC approved these changes Aug 11, 2023

View reviewed changes

stanmart added 10 commits August 16, 2023 14:23

Add missing support to categoricals

f05f515

Rename functions

1f8d875

Parametrize missing behavior in constructors

f292f22

Return a maskedarray from recover_orig

7ddcc65

Propagate missing_method when indexing

5877a48

Add tests

fd069b1

Template all the things!

97da60b

Privatize has_missing attribute

0a8daff

Add changelog entry

cd0e752

Add option to treat missing values as a category

aae2702

stanmart added 6 commits August 16, 2023 14:28

Update changelog

0011681

Raise if the missing category already exists

e212bea

Add tests for missing name and raise on existing

5f58365

Don't skip tests (they are fast)

d64630c

Apply suggestions from review

f4e8e05

Fix indxing

2a3a250

MartinStancsicsQC force-pushed the categorical-missing branch from 19c45be to 2a3a250 Compare August 16, 2023 12:34

MartinStancsicsQC added this to the Tabmat v4 milestone Aug 16, 2023

stanmart added 3 commits August 16, 2023 15:18

Fix intercept name in formulas

c2c83b3

Add missing cateegorical functinoality to formulas

9652010

Much cooler handlong of missing categoricals

6d66026

MartinStancsicsQC changed the base branch from main to tabmat-v4 August 16, 2023 15:35

MartinStancsicsQC merged commit a830107 into tabmat-v4 Aug 17, 2023
12 checks passed

MartinStancsicsQC deleted the categorical-missing branch August 17, 2023 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow missing values in `CategoricalMatrix` #281

Allow missing values in `CategoricalMatrix` #281

MartinStancsicsQC commented Jul 18, 2023 •

edited

Loading

lbittarello commented Jul 19, 2023

MartinStancsicsQC commented Jul 19, 2023 •

edited

Loading

MartinStancsicsQC commented Jul 19, 2023 •

edited

Loading

MatthiasSchmidtblaicherQC left a comment

MartinStancsicsQC commented Jul 26, 2023

MarcAntoineSchmidtQC left a comment

MarcAntoineSchmidtQC commented Aug 11, 2023

Allow missing values in CategoricalMatrix #281

Allow missing values in CategoricalMatrix #281

Conversation

MartinStancsicsQC commented Jul 18, 2023 • edited Loading

lbittarello commented Jul 19, 2023

MartinStancsicsQC commented Jul 19, 2023 • edited Loading

MartinStancsicsQC commented Jul 19, 2023 • edited Loading

MatthiasSchmidtblaicherQC left a comment

Choose a reason for hiding this comment

MartinStancsicsQC commented Jul 26, 2023

MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

MarcAntoineSchmidtQC commented Aug 11, 2023

Allow missing values in `CategoricalMatrix` #281

Allow missing values in `CategoricalMatrix` #281

MartinStancsicsQC commented Jul 18, 2023 •

edited

Loading

MartinStancsicsQC commented Jul 19, 2023 •

edited

Loading

MartinStancsicsQC commented Jul 19, 2023 •

edited

Loading