Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix decompositions inside CUDA kernel? #297

Open
coreqode opened this issue Aug 24, 2024 · 3 comments
Open

Matrix decompositions inside CUDA kernel? #297

coreqode opened this issue Aug 24, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request question The issue author requires information

Comments

@coreqode
Copy link

Is it possible to do matrix decompositions of small matrices (let's say of size 9x9 or 12x12) inside the CUDA kernel? My final goal is to run Newton method inside the kernel which requires inverse of a matrix.

@coreqode coreqode added the question The issue author requires information label Aug 24, 2024
@mmacklin
Copy link
Collaborator

Hi @coreqode, currently we only support small (3x3) decompositions - but we are planning more general decompositions, most likely for symmetric decompositions, i.e.: Cholesky first, although possibly also LU.

Would Cholesky work for your problem, are there any other types of decompositions you need for these larger sizes?

@gdaviet
Copy link
Contributor

gdaviet commented Aug 26, 2024

Note that in the meantime it is also possible to implement said decompositions directly in Warp. For instance, here are some QR and Cholesky decomposition routines that I have been using:

@wp.func
def householder_qr_decomposition(A: Any):
    """
    QR decomposition of a square matrix using Householder reflections

    Returns Q and R such that Q R = A, Q orthonormal (such that QQ^T = Id), R upper triangular
    """

    x = type(A[0])()
    Q = wp.identity(n=type(x).length, dtype=A.dtype)

    zero = x.dtype(0.0)
    two = x.dtype(2.0)

    for i in range(type(x).length):
        for k in range(type(x).length):
            x[k] = wp.select(k < i, A[k, i], zero)

        alpha = wp.length(x) * wp.sign(x[i])
        x[i] += alpha
        two_over_x_sq = wp.select(alpha == zero, two / wp.length_sq(x), zero)

        A -= wp.outer(two_over_x_sq * x, x * A)
        Q -= wp.outer(Q * x, two_over_x_sq * x)

    return Q, A
def compute_ldlt(A: Any):
    """LDLT factorization of a symmetric matrix A
    D stored in diagonal
    L stored in lower-triangular part"""

    D = type(wp.get_diag(A))(A.dtype(0.0))

    for j in range(D.length):
        tmp = wp.cw_mul(A[j], D)

        D[j] = A[j, j] - wp.dot(tmp, A[j])

        for i in range(j + 1, D.length):
            A[i, j] = A[i, j] - wp.dot(A[i], tmp)

        for i in range(j + 1, D.length):
            A[i, j] = A[i, j] / D[j]

        A[j, j] = D[j]

    return A

@coreqode
Copy link
Author

coreqode commented Aug 26, 2024

@mmacklin Thanks for the reply. For my use case, Hessian is mostly SPD so Cholesky (LDLT variant) would work. SVD would also be super-helpful which, can be used for doing polar decomposition and other things.

Thanks @gdaviet for sharing the code (especially QR). I have been using Cholesky in Taichi like this only because it doesn't support nested kernels. However, I was wondering, if nested kernels can be used here for more speedup as for some of the energy Hessians we get 18x18 matrix.

@shi-eric shi-eric added the enhancement New feature or request label Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question The issue author requires information
Projects
None yet
Development

No branches or pull requests

5 participants