Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize DenseMatrix by shifting data #414

Open
mlondschien opened this issue Nov 12, 2024 · 13 comments
Open

Standardize DenseMatrix by shifting data #414

mlondschien opened this issue Nov 12, 2024 · 13 comments

Comments

@mlondschien
Copy link
Contributor

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p) ** 2
   ...: stds = rng.exponential(10, p) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)) * stds + means
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n) / n, True, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n)

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
50.9 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
34.2 ms ± 602 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
352 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.06973587287713845
0.0697358728771389
8.881784197001252e-16
In [6]: print(np.max(np.abs(sandwich1.T - sandwich1)))
0.0

This is including #408. I guess centering the data explicitly results in an additional copy.

@mlondschien
Copy link
Contributor Author

xref Quantco/glum#872

This also happens if weights is standardized.

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p) ** 2
   ...: stds = rng.exponential(10, p) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)) * stds + means
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n) / n, True, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n)
   ...: weights /= weights.sum()

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
50.6 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
34.5 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
365 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.06973587287713845
0.0697358728771389
8.881784197001252e-16

In [6]: weights.sum()
Out[6]: np.float64(1.0)

@mlondschien
Copy link
Contributor Author

to make matters worse

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p).astype(np.float32) ** 2
   ...: stds = rng.exponential(10, p).astype(np.float32) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)).astype(np.float32) * stds + means
   ...: X = X.astype(np.float32)
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(np.float32) / n, Tru
   ...: e, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n).astype(np.float32)
   ...: weights /= weights.sum()

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
26.5 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
19.2 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
174 ms ± 6.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
32769.0
32769.0
8.34465e-07

@stanmart
Copy link
Collaborator

Interesting, the speed difference remains even when I look at larger/longer matrices (e.g. 5_000_000 x 100). It might be worth looking why StandardizedMatrix.sandwich is so much slower than DenseMatrix.sandwich.

I guess centering the data explicitly results in an additional copy.

This is a big deal IMO. Even if we implement standardizing by modifying the data, it should be optional and we should keep the possibility to do it without a copy.

@mlondschien
Copy link
Contributor Author

It might also be worth it to investigate why StandardizedMatrix.sandwich appears to return the wrong result, independently of speed.

@jtilly
Copy link
Member

jtilly commented Nov 18, 2024

It might also be worth it to investigate why StandardizedMatrix.sandwich appears to return the wrong result, independently of speed.

I don't think it (necessarily) returns the "wrong" result. Your example assumes that we're standardizing with the standard deviation, which is not quite what we do.

  • We bound the standard deviation away from zero und standardize with mult
  • Our standard deviation implementation is using a shortcut, which is probably numerically not as stable as the one that ships with pandas / numpy, etc. That's why we're seeing differences for float32.

Again, none of this really matters, because we store the multiplier with the matrix, so we're internally consistent.

No comment on the speed differences that you observed (yet).

@mlondschien
Copy link
Contributor Author

If I use the col_stds from standardize and normalize with _one_over_var_inf_to_val I still get different results:

float64:

In [4]: import numpy as np
   ...: import tabmat
   ...: from tabmat.matrix_base import _one_over_var_inf_to_val
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: dtype = np.float64
   ...: means = rng.exponential(10, p).astype(dtype) ** 2
   ...: stds = rng.exponential(10, p).astype(dtype) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)).astype(dtype) * stds + means
   ...: X = X.astype(dtype)
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(dtype) / n, True, True)
   ...: 
   ...: # emp_mean2 = X.mean(axis=0)
   ...: # emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean1) * _one_over_var_inf_to_val(emp_std1, 1.0)
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n).astype(dtype)
   ...: weights /= weights.sum()
   ...: 
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.0523996879511075
0.05239968795110772
6.661338147750939e-16

float32:

In [3]: import numpy as np
   ...: import tabmat
   ...: from tabmat.matrix_base import _one_over_var_inf_to_val
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: dtype = np.float32
   ...: means = rng.exponential(10, p).astype(dtype) ** 2
   ...: stds = rng.exponential(10, p).astype(dtype) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)).astype(dtype) * stds + means
   ...: X = X.astype(dtype)
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(dtype) / n, True, True)
   ...: 
   ...: # emp_mean2 = X.mean(axis=0)
   ...: # emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean1) * _one_over_var_inf_to_val(emp_std1, 1.0)
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n).astype(dtype)
   ...: weights /= weights.sum()
   ...: 
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
11.368562
11.368562
1.1920929e-06

@jtilly
Copy link
Member

jtilly commented Nov 18, 2024

I think I have a small scale reproducer for transpose_matvec(...):

import numpy as np
import tabmat

n = 5
p = 3
dtype = np.float32

rng = np.random.default_rng(0)
means = rng.exponential(10, p).astype(dtype) ** 2
stds = rng.exponential(10, p).astype(dtype) ** 2

X = (rng.uniform(size=(n, p)).astype(dtype) * stds + means).astype(dtype)
weights = np.full(n, 1 / n, dtype=dtype)

# What does .standardize(...) do?
stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)

# stmat.shift is equal to out_means / col_stds (there's special treatment for zero stds)
np.testing.assert_allclose(stmat.shift, -out_means / col_stds)

# stmat.mult is equal to 1 / col_stds (there's special treatment for zero stds)
np.testing.assert_allclose(stmat.mult, 1 / col_stds)

# stmat.mat.toarray() is just the original matrix
np.testing.assert_allclose(stmat.mat.toarray(), X)

# stmat.toarray() applies shift and mult to the original matrix
np.testing.assert_allclose(stmat.toarray(), X * stmat.mult[None, :] + stmat.shift[None, :])

# let's take a transpose and multiply with some other vector
v = rng.uniform(size=n).astype(dtype)
v /= weights.sum()
np.testing.assert_allclose(stmat.transpose_matvec(v), stmat.toarray().T @ v)  # this fails

I think we're a bit too clever (or not clever enough) in transpose_matvec. Some of the terms can get rather large when standard deviations are small, which makes it easy to get imprecise (or overflow) with float32.

@jtilly
Copy link
Member

jtilly commented Nov 19, 2024

Same example as above, just without any random numbers, so it's easier to reproduce:

import numpy as np
import tabmat

dtype = np.float32  # problem disappears with np.float64, but first columns are very different

# results of the NumPy based product with float64:
# [ 0.00003993 -0.14760692 -0.05215349]
# results of the NumPy based product with float32:
# [ 0.00290794 -0.14760576 -0.05215368]

X = np.array(
    [
        [46.231056, 126.05263, 144.46439],
        [46.231224, 128.66818, 0.7667693],
        [46.231186, 104.97506, 193.8872],
        [46.230835, 130.10156, 143.88954],
        [46.230896, 116.76007, 7.5629334],
    ],
    dtype=dtype,
)
v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)

weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)

stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)

# compute transpose_matvec(...)
np.testing.assert_allclose(stmat.transpose_matvec(v), stmat.toarray().T @ v)  # fails

# compute by hand
res = np.zeros(X.shape[1], dtype=dtype)
for col in range(X.shape[1]):
    res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v

np.testing.assert_allclose(res, stmat.toarray().T @ v)  # passes

# factor out stmat.shift * sum(v)
res = stmat.shift * sum(v)
for col in range(X.shape[1]):
    res[col] += (stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v

np.testing.assert_allclose(res, stmat.toarray().T @ v)  # fails

Note how the first column of X has a small standard deviation (0.0156) so we scale it with a 64.

When I factor out the shifter, I get the (almost) the same result as with stmat.transpose_matvec(v).

However, I'm not really sure what's the "correct" result here, because when I switch to float64, I get a result for the first column that's quite different (0.00003993 vs 0.00290794). Maybe it's not a good idea to go to float32 when the data is on such different scales?

@lbittarello
Copy link
Member

Might it make sense to define a "small standard deviation" in relative rather than absolute terms?

@mlondschien
Copy link
Contributor Author

Where do the numpy results in your comment come from? I get

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: for dtype in [np.float32, np.float64]:
   ...:     X = np.array(
   ...:         [
   ...:             [46.231056, 126.05263, 144.46439],
   ...:             [46.231224, 128.66818, 0.7667693],
   ...:             [46.231186, 104.97506, 193.8872],
   ...:             [46.230835, 130.10156, 143.88954],
   ...:             [46.230896, 116.76007, 7.5629334],
   ...:         ],
   ...:         dtype=dtype,
   ...:     )
   ...:     v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
   ...: 
   ...:     weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
   ...: 
   ...:     stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
   ...: 
   ...:     print(stmat.toarray().T @ v)
   ...:     print(stmat.transpose_matvec(v))
   ...: 
   ...:     # compute by hand
   ...:     res = np.zeros(X.shape[1], dtype=dtype)
   ...:     for col in range(X.shape[1]):
   ...:         res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
   ...: 
   ...:     print(res)
   ...:     print("\n")
   ...: 
[ 0.29243308 -0.1476075  -0.05215386]
[ 0.375      -0.1476078  -0.05215359]
[ 0.29243308 -0.1476075  -0.05215386]


[ 0.25974406 -0.14760692 -0.05215349]
[ 0.25974406 -0.14760692 -0.05215349]
[ 0.25974406 -0.14760692 -0.05215349]

Different numbers, but same effect.

@jtilly
Copy link
Member

jtilly commented Nov 19, 2024

Yes, I get

[ 0.00290794 -0.14760576 -0.05215368]
[ 0.00341797 -0.1476059  -0.05215335]
[ 0.00290794 -0.14760576 -0.05215368]


[ 0.00003993 -0.14760692 -0.05215349]
[ 0.00003993 -0.14760692 -0.05215349]
[ 0.00003993 -0.14760692 -0.05215349]

when I run your script (see here for a CI run with different numbers).

Why do we have the shifter / multiplier in the first place? The point is that we don't want turn a sparse matrix into a dense matrix just because we're standardizing it. I would be open to just shifting / multiplying dense matrices and calling "normal" matrix operations on the standardized matrix.

Alternatively/additionally, float32 + data on very different scales could be one of the instances where the user is better off standardizing outside of tabmat (and forgoing sparsity).

@mlondschien
Copy link
Contributor Author

My results above were run on an M1 macbook with tabmat=4.0.1. After upgrading to tabmat=4.1.0 I get

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: for dtype in [np.float32, np.float64]:
   ...:     X = np.array(
   ...:         [
   ...:             [46.231056, 126.05263, 144.46439],
   ...:             [46.231224, 128.66818, 0.7667693],
   ...:             [46.231186, 104.97506, 193.8872],
   ...:             [46.230835, 130.10156, 143.88954],
   ...:             [46.230896, 116.76007, 7.5629334],
   ...:         ],
   ...:         dtype=dtype,
   ...:     )
   ...:     v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
   ...: 
   ...:     weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
   ...: 
   ...:     stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
   ...: 
   ...:     print(stmat.toarray().T @ v)
   ...:     print(stmat.transpose_matvec(v))
   ...: 
   ...:     # compute by hand
   ...:     res = np.zeros(X.shape[1], dtype=dtype)
   ...:     for col in range(X.shape[1]):
   ...:         res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
   ...: 
   ...:     print(res)
   ...:     print("\n")
   ...: 
[ 0.00290794 -0.14760576 -0.05215368]
[ 0.00341797 -0.1476059  -0.05215335]
[ 0.00290794 -0.14760576 -0.05215368]


[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]

on a linux x86 with tabmat=4.1.0 I get

In [1]: 
   ...: import numpy as np
   ...: import tabmat
   ...: 
   ...: for dtype in [np.float32, np.float64]:
   ...:     X = np.array(
   ...:         [
   ...:             [46.231056, 126.05263, 144.46439],
   ...:             [46.231224, 128.66818, 0.7667693],
   ...:             [46.231186, 104.97506, 193.8872],
   ...:             [46.230835, 130.10156, 143.88954],
   ...:             [46.230896, 116.76007, 7.5629334],
   ...:         ],
   ...:         dtype=dtype,
   ...:     )
   ...:     v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
   ...: 
   ...:     weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
   ...: 
   ...:     stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
   ...: 
   ...:     print(stmat.toarray().T @ v)
   ...:     print(stmat.transpose_matvec(v))
   ...: 
   ...:     # compute by hand
   ...:     res = np.zeros(X.shape[1], dtype=dtype)
   ...:     for col in range(X.shape[1]):
   ...:         res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
   ...: 
   ...:     print(res)
   ...:     print("\n")
   ...: 
[ 3.6124271e-05 -1.4760569e-01 -5.2153736e-02]
[ 4.5776367e-05 -1.4760208e-01 -5.2153349e-02]
[ 3.612428e-05 -1.476057e-01 -5.215369e-02]


[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]

I'm surprised you get different results for float64.

@jtilly
Copy link
Member

jtilly commented Nov 19, 2024

I'm surprised you get different results for float64.

Yeah, never mind. I had some local changes. You can see in CI runs that the float64 results are consistent across platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants