Matrix decompositions inside CUDA kernel? #297

coreqode · 2024-08-24T06:47:06Z

Is it possible to do matrix decompositions of small matrices (let's say of size 9x9 or 12x12) inside the CUDA kernel? My final goal is to run Newton method inside the kernel which requires inverse of a matrix.

mmacklin · 2024-08-26T01:37:52Z

Hi @coreqode, currently we only support small (3x3) decompositions - but we are planning more general decompositions, most likely for symmetric decompositions, i.e.: Cholesky first, although possibly also LU.

Would Cholesky work for your problem, are there any other types of decompositions you need for these larger sizes?

gdaviet · 2024-08-26T05:43:39Z

Note that in the meantime it is also possible to implement said decompositions directly in Warp. For instance, here are some QR and Cholesky decomposition routines that I have been using:

@wp.func
def householder_qr_decomposition(A: Any):
    """
    QR decomposition of a square matrix using Householder reflections

    Returns Q and R such that Q R = A, Q orthonormal (such that QQ^T = Id), R upper triangular
    """

    x = type(A[0])()
    Q = wp.identity(n=type(x).length, dtype=A.dtype)

    zero = x.dtype(0.0)
    two = x.dtype(2.0)

    for i in range(type(x).length):
        for k in range(type(x).length):
            x[k] = wp.select(k < i, A[k, i], zero)

        alpha = wp.length(x) * wp.sign(x[i])
        x[i] += alpha
        two_over_x_sq = wp.select(alpha == zero, two / wp.length_sq(x), zero)

        A -= wp.outer(two_over_x_sq * x, x * A)
        Q -= wp.outer(Q * x, two_over_x_sq * x)

    return Q, A

def compute_ldlt(A: Any):
    """LDLT factorization of a symmetric matrix A
    D stored in diagonal
    L stored in lower-triangular part"""

    D = type(wp.get_diag(A))(A.dtype(0.0))

    for j in range(D.length):
        tmp = wp.cw_mul(A[j], D)

        D[j] = A[j, j] - wp.dot(tmp, A[j])

        for i in range(j + 1, D.length):
            A[i, j] = A[i, j] - wp.dot(A[i], tmp)

        for i in range(j + 1, D.length):
            A[i, j] = A[i, j] / D[j]

        A[j, j] = D[j]

    return A

coreqode · 2024-08-26T05:54:46Z

@mmacklin Thanks for the reply. For my use case, Hessian is mostly SPD so Cholesky (LDLT variant) would work. SVD would also be super-helpful which, can be used for doing polar decomposition and other things.

Thanks @gdaviet for sharing the code (especially QR). I have been using Cholesky in Taichi like this only because it doesn't support nested kernels. However, I was wondering, if nested kernels can be used here for more speedup as for some of the energy Hessians we get 18x18 matrix.

coreqode added the question The issue author requires information label Aug 24, 2024

shi-eric assigned christophercrouzet Aug 25, 2024

mmacklin assigned mmacklin and unassigned christophercrouzet Aug 25, 2024

shi-eric added the enhancement New feature or request label Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix decompositions inside CUDA kernel? #297

Matrix decompositions inside CUDA kernel? #297

coreqode commented Aug 24, 2024

mmacklin commented Aug 26, 2024

gdaviet commented Aug 26, 2024 •

edited

Loading

coreqode commented Aug 26, 2024 •

edited

Loading

Matrix decompositions inside CUDA kernel? #297

Matrix decompositions inside CUDA kernel? #297

Comments

coreqode commented Aug 24, 2024

mmacklin commented Aug 26, 2024

gdaviet commented Aug 26, 2024 • edited Loading

coreqode commented Aug 26, 2024 • edited Loading

gdaviet commented Aug 26, 2024 •

edited

Loading

coreqode commented Aug 26, 2024 •

edited

Loading