Standardize `DenseMatrix` by shifting data #414

mlondschien · 2024-11-12T16:04:31Z

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p) ** 2
   ...: stds = rng.exponential(10, p) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)) * stds + means
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n) / n, True, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n)

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
50.9 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
34.2 ms ± 602 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
352 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.06973587287713845
0.0697358728771389
8.881784197001252e-16
In [6]: print(np.max(np.abs(sandwich1.T - sandwich1)))
0.0

This is including #408. I guess centering the data explicitly results in an additional copy.

mlondschien · 2024-11-12T16:07:42Z

xref Quantco/glum#872

This also happens if weights is standardized.

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p) ** 2
   ...: stds = rng.exponential(10, p) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)) * stds + means
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n) / n, True, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n)
   ...: weights /= weights.sum()

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
50.6 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
34.5 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
365 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.06973587287713845
0.0697358728771389
8.881784197001252e-16

In [6]: weights.sum()
Out[6]: np.float64(1.0)

mlondschien · 2024-11-12T16:15:23Z

to make matters worse

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p).astype(np.float32) ** 2
   ...: stds = rng.exponential(10, p).astype(np.float32) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)).astype(np.float32) * stds + means
   ...: X = X.astype(np.float32)
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(np.float32) / n, Tru
   ...: e, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n).astype(np.float32)
   ...: weights /= weights.sum()

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
26.5 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
19.2 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
174 ms ± 6.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
32769.0
32769.0
8.34465e-07

stanmart · 2024-11-12T16:32:27Z

Interesting, the speed difference remains even when I look at larger/longer matrices (e.g. 5_000_000 x 100). It might be worth looking why StandardizedMatrix.sandwich is so much slower than DenseMatrix.sandwich.

I guess centering the data explicitly results in an additional copy.

This is a big deal IMO. Even if we implement standardizing by modifying the data, it should be optional and we should keep the possibility to do it without a copy.

mlondschien · 2024-11-12T16:34:19Z

It might also be worth it to investigate why StandardizedMatrix.sandwich appears to return the wrong result, independently of speed.

jtilly · 2024-11-18T09:01:05Z

It might also be worth it to investigate why StandardizedMatrix.sandwich appears to return the wrong result, independently of speed.

I don't think it (necessarily) returns the "wrong" result. Your example assumes that we're standardizing with the standard deviation, which is not quite what we do.

We bound the standard deviation away from zero und standardize with mult
Our standard deviation implementation is using a shortcut, which is probably numerically not as stable as the one that ships with pandas / numpy, etc. That's why we're seeing differences for float32.

Again, none of this really matters, because we store the multiplier with the matrix, so we're internally consistent.

No comment on the speed differences that you observed (yet).

mlondschien · 2024-11-18T09:46:19Z

If I use the col_stds from standardize and normalize with _one_over_var_inf_to_val I still get different results:

float64:

In [4]: import numpy as np
   ...: import tabmat
   ...: from tabmat.matrix_base import _one_over_var_inf_to_val
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: dtype = np.float64
   ...: means = rng.exponential(10, p).astype(dtype) ** 2
   ...: stds = rng.exponential(10, p).astype(dtype) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)).astype(dtype) * stds + means
   ...: X = X.astype(dtype)
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(dtype) / n, True, True)
   ...: 
   ...: # emp_mean2 = X.mean(axis=0)
   ...: # emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean1) * _one_over_var_inf_to_val(emp_std1, 1.0)
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n).astype(dtype)
   ...: weights /= weights.sum()
   ...: 
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.0523996879511075
0.05239968795110772
6.661338147750939e-16

float32:

In [3]: import numpy as np
   ...: import tabmat
   ...: from tabmat.matrix_base import _one_over_var_inf_to_val
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: dtype = np.float32
   ...: means = rng.exponential(10, p).astype(dtype) ** 2
   ...: stds = rng.exponential(10, p).astype(dtype) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)).astype(dtype) * stds + means
   ...: X = X.astype(dtype)
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(dtype) / n, True, True)
   ...: 
   ...: # emp_mean2 = X.mean(axis=0)
   ...: # emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean1) * _one_over_var_inf_to_val(emp_std1, 1.0)
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n).astype(dtype)
   ...: weights /= weights.sum()
   ...: 
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
11.368562
11.368562
1.1920929e-06

jtilly · 2024-11-18T15:41:53Z

I think I have a small scale reproducer for transpose_matvec(...):

import numpy as np
import tabmat

n = 5
p = 3
dtype = np.float32

rng = np.random.default_rng(0)
means = rng.exponential(10, p).astype(dtype) ** 2
stds = rng.exponential(10, p).astype(dtype) ** 2

X = (rng.uniform(size=(n, p)).astype(dtype) * stds + means).astype(dtype)
weights = np.full(n, 1 / n, dtype=dtype)

# What does .standardize(...) do?
stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)

# stmat.shift is equal to out_means / col_stds (there's special treatment for zero stds)
np.testing.assert_allclose(stmat.shift, -out_means / col_stds)

# stmat.mult is equal to 1 / col_stds (there's special treatment for zero stds)
np.testing.assert_allclose(stmat.mult, 1 / col_stds)

# stmat.mat.toarray() is just the original matrix
np.testing.assert_allclose(stmat.mat.toarray(), X)

# stmat.toarray() applies shift and mult to the original matrix
np.testing.assert_allclose(stmat.toarray(), X * stmat.mult[None, :] + stmat.shift[None, :])

# let's take a transpose and multiply with some other vector
v = rng.uniform(size=n).astype(dtype)
v /= weights.sum()
np.testing.assert_allclose(stmat.transpose_matvec(v), stmat.toarray().T @ v)  # this fails

I think we're a bit too clever (or not clever enough) in transpose_matvec. Some of the terms can get rather large when standard deviations are small, which makes it easy to get imprecise (or overflow) with float32.

jtilly · 2024-11-19T08:21:05Z

Same example as above, just without any random numbers, so it's easier to reproduce:

import numpy as np
import tabmat

dtype = np.float32  # problem disappears with np.float64, but first columns are very different

# results of the NumPy based product with float64:
# [ 0.00003993 -0.14760692 -0.05215349]
# results of the NumPy based product with float32:
# [ 0.00290794 -0.14760576 -0.05215368]

X = np.array(
    [
        [46.231056, 126.05263, 144.46439],
        [46.231224, 128.66818, 0.7667693],
        [46.231186, 104.97506, 193.8872],
        [46.230835, 130.10156, 143.88954],
        [46.230896, 116.76007, 7.5629334],
    ],
    dtype=dtype,
)
v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)

weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)

stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)

# compute transpose_matvec(...)
np.testing.assert_allclose(stmat.transpose_matvec(v), stmat.toarray().T @ v)  # fails

# compute by hand
res = np.zeros(X.shape[1], dtype=dtype)
for col in range(X.shape[1]):
    res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v

np.testing.assert_allclose(res, stmat.toarray().T @ v)  # passes

# factor out stmat.shift * sum(v)
res = stmat.shift * sum(v)
for col in range(X.shape[1]):
    res[col] += (stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v

np.testing.assert_allclose(res, stmat.toarray().T @ v)  # fails

Note how the first column of X has a small standard deviation (0.0156) so we scale it with a 64.

When I factor out the shifter, I get the (almost) the same result as with stmat.transpose_matvec(v).

However, I'm not really sure what's the "correct" result here, because when I switch to float64, I get a result for the first column that's quite different (0.00003993 vs 0.00290794). Maybe it's not a good idea to go to float32 when the data is on such different scales?

lbittarello · 2024-11-19T09:06:26Z

Might it make sense to define a "small standard deviation" in relative rather than absolute terms?

mlondschien · 2024-11-19T09:37:23Z

Where do the numpy results in your comment come from? I get

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: for dtype in [np.float32, np.float64]:
   ...:     X = np.array(
   ...:         [
   ...:             [46.231056, 126.05263, 144.46439],
   ...:             [46.231224, 128.66818, 0.7667693],
   ...:             [46.231186, 104.97506, 193.8872],
   ...:             [46.230835, 130.10156, 143.88954],
   ...:             [46.230896, 116.76007, 7.5629334],
   ...:         ],
   ...:         dtype=dtype,
   ...:     )
   ...:     v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
   ...: 
   ...:     weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
   ...: 
   ...:     stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
   ...: 
   ...:     print(stmat.toarray().T @ v)
   ...:     print(stmat.transpose_matvec(v))
   ...: 
   ...:     # compute by hand
   ...:     res = np.zeros(X.shape[1], dtype=dtype)
   ...:     for col in range(X.shape[1]):
   ...:         res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
   ...: 
   ...:     print(res)
   ...:     print("\n")
   ...: 
[ 0.29243308 -0.1476075  -0.05215386]
[ 0.375      -0.1476078  -0.05215359]
[ 0.29243308 -0.1476075  -0.05215386]


[ 0.25974406 -0.14760692 -0.05215349]
[ 0.25974406 -0.14760692 -0.05215349]
[ 0.25974406 -0.14760692 -0.05215349]

Different numbers, but same effect.

jtilly · 2024-11-19T09:57:53Z

Yes, I get

[ 0.00290794 -0.14760576 -0.05215368]
[ 0.00341797 -0.1476059  -0.05215335]
[ 0.00290794 -0.14760576 -0.05215368]


[ 0.00003993 -0.14760692 -0.05215349]
[ 0.00003993 -0.14760692 -0.05215349]
[ 0.00003993 -0.14760692 -0.05215349]

when I run your script (see here for a CI run with different numbers).

Why do we have the shifter / multiplier in the first place? The point is that we don't want turn a sparse matrix into a dense matrix just because we're standardizing it. I would be open to just shifting / multiplying dense matrices and calling "normal" matrix operations on the standardized matrix.

Alternatively/additionally, float32 + data on very different scales could be one of the instances where the user is better off standardizing outside of tabmat (and forgoing sparsity).

mlondschien · 2024-11-19T10:08:14Z

My results above were run on an M1 macbook with tabmat=4.0.1. After upgrading to tabmat=4.1.0 I get

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: for dtype in [np.float32, np.float64]:
   ...:     X = np.array(
   ...:         [
   ...:             [46.231056, 126.05263, 144.46439],
   ...:             [46.231224, 128.66818, 0.7667693],
   ...:             [46.231186, 104.97506, 193.8872],
   ...:             [46.230835, 130.10156, 143.88954],
   ...:             [46.230896, 116.76007, 7.5629334],
   ...:         ],
   ...:         dtype=dtype,
   ...:     )
   ...:     v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
   ...: 
   ...:     weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
   ...: 
   ...:     stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
   ...: 
   ...:     print(stmat.toarray().T @ v)
   ...:     print(stmat.transpose_matvec(v))
   ...: 
   ...:     # compute by hand
   ...:     res = np.zeros(X.shape[1], dtype=dtype)
   ...:     for col in range(X.shape[1]):
   ...:         res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
   ...: 
   ...:     print(res)
   ...:     print("\n")
   ...: 
[ 0.00290794 -0.14760576 -0.05215368]
[ 0.00341797 -0.1476059  -0.05215335]
[ 0.00290794 -0.14760576 -0.05215368]


[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]

on a linux x86 with tabmat=4.1.0 I get

In [1]: 
   ...: import numpy as np
   ...: import tabmat
   ...: 
   ...: for dtype in [np.float32, np.float64]:
   ...:     X = np.array(
   ...:         [
   ...:             [46.231056, 126.05263, 144.46439],
   ...:             [46.231224, 128.66818, 0.7667693],
   ...:             [46.231186, 104.97506, 193.8872],
   ...:             [46.230835, 130.10156, 143.88954],
   ...:             [46.230896, 116.76007, 7.5629334],
   ...:         ],
   ...:         dtype=dtype,
   ...:     )
   ...:     v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
   ...: 
   ...:     weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
   ...: 
   ...:     stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
   ...: 
   ...:     print(stmat.toarray().T @ v)
   ...:     print(stmat.transpose_matvec(v))
   ...: 
   ...:     # compute by hand
   ...:     res = np.zeros(X.shape[1], dtype=dtype)
   ...:     for col in range(X.shape[1]):
   ...:         res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
   ...: 
   ...:     print(res)
   ...:     print("\n")
   ...: 
[ 3.6124271e-05 -1.4760569e-01 -5.2153736e-02]
[ 4.5776367e-05 -1.4760208e-01 -5.2153349e-02]
[ 3.612428e-05 -1.476057e-01 -5.215369e-02]


[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]

I'm surprised you get different results for float64.

jtilly · 2024-11-19T10:37:32Z

I'm surprised you get different results for float64.

Yeah, never mind. I had some local changes. You can see in CI runs that the float64 results are consistent across platforms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize `DenseMatrix` by shifting data #414

Standardize `DenseMatrix` by shifting data #414

mlondschien commented Nov 12, 2024

mlondschien commented Nov 12, 2024

mlondschien commented Nov 12, 2024

stanmart commented Nov 12, 2024

mlondschien commented Nov 12, 2024

jtilly commented Nov 18, 2024

mlondschien commented Nov 18, 2024

jtilly commented Nov 18, 2024

jtilly commented Nov 19, 2024

lbittarello commented Nov 19, 2024

mlondschien commented Nov 19, 2024

jtilly commented Nov 19, 2024

mlondschien commented Nov 19, 2024

jtilly commented Nov 19, 2024 •

edited

Loading

Standardize DenseMatrix by shifting data #414

Standardize DenseMatrix by shifting data #414

Comments

mlondschien commented Nov 12, 2024

mlondschien commented Nov 12, 2024

mlondschien commented Nov 12, 2024

stanmart commented Nov 12, 2024

mlondschien commented Nov 12, 2024

jtilly commented Nov 18, 2024

mlondschien commented Nov 18, 2024

jtilly commented Nov 18, 2024

jtilly commented Nov 19, 2024

lbittarello commented Nov 19, 2024

mlondschien commented Nov 19, 2024

jtilly commented Nov 19, 2024

mlondschien commented Nov 19, 2024

jtilly commented Nov 19, 2024 • edited Loading

Standardize `DenseMatrix` by shifting data #414

Standardize `DenseMatrix` by shifting data #414

jtilly commented Nov 19, 2024 •

edited

Loading