-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize DenseMatrix
by shifting data
#414
Comments
xref Quantco/glum#872 This also happens if In [1]: import numpy as np
...: import tabmat
...:
...: n = 5_000
...: p = 1_000
...:
...: rng = np.random.default_rng(0)
...: means = rng.exponential(10, p) ** 2
...: stds = rng.exponential(10, p) ** 2
...:
...: X = rng.uniform(size=(n, p)) * stds + means
...:
...: matrix = tabmat.DenseMatrix(X)
...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n) / n, True, True)
...:
...: emp_mean2 = X.mean(axis=0)
...: emp_std2 = X.std(axis=0)
...: X = (X - emp_mean2) / emp_std2
...: standardized_matrix2 = tabmat.DenseMatrix(X)
...:
...: weights = rng.uniform(size=n)
...: weights /= weights.sum()
In [2]: %%timeit
...: sandwich1 = standardized_matrix1.sandwich(weights)
...:
...:
50.6 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %%timeit
...: sandwich2 = standardized_matrix2.sandwich(weights)
...:
...:
34.5 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %%timeit
...: sandwich3 = X.T @ np.diag(weights) @ X
...:
...:
365 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
...: sandwich2 = standardized_matrix2.sandwich(weights)
...: sandwich3 = X.T @ np.diag(weights) @ X
...:
...: print(np.max(np.abs(sandwich1 - sandwich2)))
...: print(np.max(np.abs(sandwich1 - sandwich3)))
...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.06973587287713845
0.0697358728771389
8.881784197001252e-16
In [6]: weights.sum()
Out[6]: np.float64(1.0) |
to make matters worse In [1]: import numpy as np
...: import tabmat
...:
...: n = 5_000
...: p = 1_000
...:
...: rng = np.random.default_rng(0)
...: means = rng.exponential(10, p).astype(np.float32) ** 2
...: stds = rng.exponential(10, p).astype(np.float32) ** 2
...:
...: X = rng.uniform(size=(n, p)).astype(np.float32) * stds + means
...: X = X.astype(np.float32)
...:
...: matrix = tabmat.DenseMatrix(X)
...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(np.float32) / n, Tru
...: e, True)
...:
...: emp_mean2 = X.mean(axis=0)
...: emp_std2 = X.std(axis=0)
...: X = (X - emp_mean2) / emp_std2
...: standardized_matrix2 = tabmat.DenseMatrix(X)
...:
...: weights = rng.uniform(size=n).astype(np.float32)
...: weights /= weights.sum()
In [2]: %%timeit
...: sandwich1 = standardized_matrix1.sandwich(weights)
...:
...:
26.5 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %%timeit
...: sandwich2 = standardized_matrix2.sandwich(weights)
...:
...:
19.2 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %%timeit
...: sandwich3 = X.T @ np.diag(weights) @ X
...:
...:
174 ms ± 6.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
...: sandwich2 = standardized_matrix2.sandwich(weights)
...: sandwich3 = X.T @ np.diag(weights) @ X
...:
...: print(np.max(np.abs(sandwich1 - sandwich2)))
...: print(np.max(np.abs(sandwich1 - sandwich3)))
...: print(np.max(np.abs(sandwich2 - sandwich3)))
32769.0
32769.0
8.34465e-07 |
Interesting, the speed difference remains even when I look at larger/longer matrices (e.g. 5_000_000 x 100). It might be worth looking why
This is a big deal IMO. Even if we implement standardizing by modifying the data, it should be optional and we should keep the possibility to do it without a copy. |
It might also be worth it to investigate why |
I don't think it (necessarily) returns the "wrong" result. Your example assumes that we're standardizing with the standard deviation, which is not quite what we do.
Again, none of this really matters, because we store the multiplier with the matrix, so we're internally consistent. No comment on the speed differences that you observed (yet). |
If I use the
In [4]: import numpy as np
...: import tabmat
...: from tabmat.matrix_base import _one_over_var_inf_to_val
...:
...: n = 5_000
...: p = 1_000
...:
...: rng = np.random.default_rng(0)
...: dtype = np.float64
...: means = rng.exponential(10, p).astype(dtype) ** 2
...: stds = rng.exponential(10, p).astype(dtype) ** 2
...:
...: X = rng.uniform(size=(n, p)).astype(dtype) * stds + means
...: X = X.astype(dtype)
...:
...: matrix = tabmat.DenseMatrix(X)
...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(dtype) / n, True, True)
...:
...: # emp_mean2 = X.mean(axis=0)
...: # emp_std2 = X.std(axis=0)
...: X = (X - emp_mean1) * _one_over_var_inf_to_val(emp_std1, 1.0)
...: standardized_matrix2 = tabmat.DenseMatrix(X)
...:
...: weights = rng.uniform(size=n).astype(dtype)
...: weights /= weights.sum()
...:
...: sandwich1 = standardized_matrix1.sandwich(weights)
...: sandwich2 = standardized_matrix2.sandwich(weights)
...: sandwich3 = X.T @ np.diag(weights) @ X
...:
...: print(np.max(np.abs(sandwich1 - sandwich2)))
...: print(np.max(np.abs(sandwich1 - sandwich3)))
...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.0523996879511075
0.05239968795110772
6.661338147750939e-16
In [3]: import numpy as np
...: import tabmat
...: from tabmat.matrix_base import _one_over_var_inf_to_val
...:
...: n = 5_000
...: p = 1_000
...:
...: rng = np.random.default_rng(0)
...: dtype = np.float32
...: means = rng.exponential(10, p).astype(dtype) ** 2
...: stds = rng.exponential(10, p).astype(dtype) ** 2
...:
...: X = rng.uniform(size=(n, p)).astype(dtype) * stds + means
...: X = X.astype(dtype)
...:
...: matrix = tabmat.DenseMatrix(X)
...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(dtype) / n, True, True)
...:
...: # emp_mean2 = X.mean(axis=0)
...: # emp_std2 = X.std(axis=0)
...: X = (X - emp_mean1) * _one_over_var_inf_to_val(emp_std1, 1.0)
...: standardized_matrix2 = tabmat.DenseMatrix(X)
...:
...: weights = rng.uniform(size=n).astype(dtype)
...: weights /= weights.sum()
...:
...: sandwich1 = standardized_matrix1.sandwich(weights)
...: sandwich2 = standardized_matrix2.sandwich(weights)
...: sandwich3 = X.T @ np.diag(weights) @ X
...:
...: print(np.max(np.abs(sandwich1 - sandwich2)))
...: print(np.max(np.abs(sandwich1 - sandwich3)))
...: print(np.max(np.abs(sandwich2 - sandwich3)))
11.368562
11.368562
1.1920929e-06 |
I think I have a small scale reproducer for import numpy as np
import tabmat
n = 5
p = 3
dtype = np.float32
rng = np.random.default_rng(0)
means = rng.exponential(10, p).astype(dtype) ** 2
stds = rng.exponential(10, p).astype(dtype) ** 2
X = (rng.uniform(size=(n, p)).astype(dtype) * stds + means).astype(dtype)
weights = np.full(n, 1 / n, dtype=dtype)
# What does .standardize(...) do?
stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
# stmat.shift is equal to out_means / col_stds (there's special treatment for zero stds)
np.testing.assert_allclose(stmat.shift, -out_means / col_stds)
# stmat.mult is equal to 1 / col_stds (there's special treatment for zero stds)
np.testing.assert_allclose(stmat.mult, 1 / col_stds)
# stmat.mat.toarray() is just the original matrix
np.testing.assert_allclose(stmat.mat.toarray(), X)
# stmat.toarray() applies shift and mult to the original matrix
np.testing.assert_allclose(stmat.toarray(), X * stmat.mult[None, :] + stmat.shift[None, :])
# let's take a transpose and multiply with some other vector
v = rng.uniform(size=n).astype(dtype)
v /= weights.sum()
np.testing.assert_allclose(stmat.transpose_matvec(v), stmat.toarray().T @ v) # this fails I think we're a bit too clever (or not clever enough) in |
Same example as above, just without any random numbers, so it's easier to reproduce: import numpy as np
import tabmat
dtype = np.float32 # problem disappears with np.float64, but first columns are very different
# results of the NumPy based product with float64:
# [ 0.00003993 -0.14760692 -0.05215349]
# results of the NumPy based product with float32:
# [ 0.00290794 -0.14760576 -0.05215368]
X = np.array(
[
[46.231056, 126.05263, 144.46439],
[46.231224, 128.66818, 0.7667693],
[46.231186, 104.97506, 193.8872],
[46.230835, 130.10156, 143.88954],
[46.230896, 116.76007, 7.5629334],
],
dtype=dtype,
)
v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
# compute transpose_matvec(...)
np.testing.assert_allclose(stmat.transpose_matvec(v), stmat.toarray().T @ v) # fails
# compute by hand
res = np.zeros(X.shape[1], dtype=dtype)
for col in range(X.shape[1]):
res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
np.testing.assert_allclose(res, stmat.toarray().T @ v) # passes
# factor out stmat.shift * sum(v)
res = stmat.shift * sum(v)
for col in range(X.shape[1]):
res[col] += (stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
np.testing.assert_allclose(res, stmat.toarray().T @ v) # fails Note how the first column of When I factor out the shifter, I get the (almost) the same result as with However, I'm not really sure what's the "correct" result here, because when I switch to float64, I get a result for the first column that's quite different ( |
Might it make sense to define a "small standard deviation" in relative rather than absolute terms? |
Where do the numpy results in your comment come from? I get In [1]: import numpy as np
...: import tabmat
...:
...: for dtype in [np.float32, np.float64]:
...: X = np.array(
...: [
...: [46.231056, 126.05263, 144.46439],
...: [46.231224, 128.66818, 0.7667693],
...: [46.231186, 104.97506, 193.8872],
...: [46.230835, 130.10156, 143.88954],
...: [46.230896, 116.76007, 7.5629334],
...: ],
...: dtype=dtype,
...: )
...: v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
...:
...: weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
...:
...: stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
...:
...: print(stmat.toarray().T @ v)
...: print(stmat.transpose_matvec(v))
...:
...: # compute by hand
...: res = np.zeros(X.shape[1], dtype=dtype)
...: for col in range(X.shape[1]):
...: res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
...:
...: print(res)
...: print("\n")
...:
[ 0.29243308 -0.1476075 -0.05215386]
[ 0.375 -0.1476078 -0.05215359]
[ 0.29243308 -0.1476075 -0.05215386]
[ 0.25974406 -0.14760692 -0.05215349]
[ 0.25974406 -0.14760692 -0.05215349]
[ 0.25974406 -0.14760692 -0.05215349] Different numbers, but same effect. |
Yes, I get
when I run your script (see here for a CI run with different numbers). Why do we have the shifter / multiplier in the first place? The point is that we don't want turn a sparse matrix into a dense matrix just because we're standardizing it. I would be open to just shifting / multiplying dense matrices and calling "normal" matrix operations on the standardized matrix. Alternatively/additionally, float32 + data on very different scales could be one of the instances where the user is better off standardizing outside of tabmat (and forgoing sparsity). |
My results above were run on an M1 macbook with In [1]: import numpy as np
...: import tabmat
...:
...: for dtype in [np.float32, np.float64]:
...: X = np.array(
...: [
...: [46.231056, 126.05263, 144.46439],
...: [46.231224, 128.66818, 0.7667693],
...: [46.231186, 104.97506, 193.8872],
...: [46.230835, 130.10156, 143.88954],
...: [46.230896, 116.76007, 7.5629334],
...: ],
...: dtype=dtype,
...: )
...: v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
...:
...: weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
...:
...: stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
...:
...: print(stmat.toarray().T @ v)
...: print(stmat.transpose_matvec(v))
...:
...: # compute by hand
...: res = np.zeros(X.shape[1], dtype=dtype)
...: for col in range(X.shape[1]):
...: res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
...:
...: print(res)
...: print("\n")
...:
[ 0.00290794 -0.14760576 -0.05215368]
[ 0.00341797 -0.1476059 -0.05215335]
[ 0.00290794 -0.14760576 -0.05215368]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349] on a linux x86 with In [1]:
...: import numpy as np
...: import tabmat
...:
...: for dtype in [np.float32, np.float64]:
...: X = np.array(
...: [
...: [46.231056, 126.05263, 144.46439],
...: [46.231224, 128.66818, 0.7667693],
...: [46.231186, 104.97506, 193.8872],
...: [46.230835, 130.10156, 143.88954],
...: [46.230896, 116.76007, 7.5629334],
...: ],
...: dtype=dtype,
...: )
...: v = np.array([0.12428328, 0.67062443, 0.6471895, 0.6153851, 0.38367754], dtype=dtype)
...:
...: weights = np.full(X.shape[0], 1 / X.shape[0], dtype=dtype)
...:
...: stmat, out_means, col_stds = tabmat.DenseMatrix(X).standardize(weights, True, True)
...:
...: print(stmat.toarray().T @ v)
...: print(stmat.transpose_matvec(v))
...:
...: # compute by hand
...: res = np.zeros(X.shape[1], dtype=dtype)
...: for col in range(X.shape[1]):
...: res[col] += (stmat.shift[col] + stmat.mult[col] * stmat.mat.toarray()[:, col]) @ v
...:
...: print(res)
...: print("\n")
...:
[ 3.6124271e-05 -1.4760569e-01 -5.2153736e-02]
[ 4.5776367e-05 -1.4760208e-01 -5.2153349e-02]
[ 3.612428e-05 -1.476057e-01 -5.215369e-02]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349]
[ 0.25974476 -0.14760692 -0.05215349] I'm surprised you get different results for float64. |
Yeah, never mind. I had some local changes. You can see in CI runs that the float64 results are consistent across platforms. |
This is including #408. I guess centering the data explicitly results in an additional copy.
The text was updated successfully, but these errors were encountered: