What would it take to add GPU support? #789

palday · 2024-10-17T04:11:27Z

palday
Oct 17, 2024
Maintainer

Note

For now, I'll focus on the linear mixed model. While the generalized linear mixed model uses a LinearMixedModel internally for storage, re-implementing PIRLS and nAQG on the GPU would require a quite different approach than the direct computation involved in each step of fitting the linear mixed model.

Background

Broadly speaking, MixedModels.jl's speed on the CPU comes from extensive specialization on matrices with a particular form. Multiple dispatch is very convenient for this: we can specialize the matrix multiplication A*B not just for general sparsity but for the particular sparsity pattern of A and B. This is one of the advancements relative to its predecessor in R, lme4, though both rely on the same basic insight (formulating the model fitting problem as a penalized least squares problem that sparse methods can be applied to instead of the generalized least squares problem that software like nlme uses). In particular, we specialize on the sparsity patterns that occur in classical single-membership models: diagonal matrices and block sparse matrices.¹

Specialized sparse storage of blocks of `L`

The specialized sparsity patterns are represented using three types:

Diagonal from the LinearAlgebra stdlib
BlockedSparse
UniformBlockDiagonal

Various specializations of LinearAlgebra.mul! for these types and their adjoints are defined in linalg.jl. Additionally, linalg/rankUpdate.jl defines rankupdate! methods for combinations of these functions. Realistically, we could replace most calls to rankUpdate! and thus the associated methods to calls to 5-argument mul! and "just" provide a few additional methods of mul! for our specialized types. However, rankUpdate! predates 5-argument mul! by several Julia versions and we never got around to seeing if it would be faster to swap. Finally, linalg/logdet.jl and linalg/cholUnblocked.jl provide specialized implementations of an in-place log determinant and blocked Cholesky factorization, respectively.²

All of these types are used to represent blocks of L, the lower Cholesky factor, at the heart of the parameter estimation. Incredibly, the biggest computational expense in each step of the optimization process is the call to updateL! -- the actual evaluation of the objective function after updateL! is extremely fast, even for extremely large and complex models.

Note

L is currently stored in "fully blocked" form. You can explore the structure of blocks for a given model by examining BlockDescription(model). However, the block structure may be simplified into three blocks:

a Diagonal or UniformBlockDiagonal block
a dense rectangular block
and a dense lower triangular block

This representation is actually not that much less efficient for some things. For a crossed design with two grouping variables, there is already a lot of fill-in such that the L[2,2] block is already dense (and we know that updateL! spends most of its time there³. Moreover, it can sometimes be faster to use dense matrices, even for sparse data, because of vectorization and costs of memory access -- this holds doubly so for GPU. In other words, it maybe worthwhile examining the 3-block-L representation as the representation for use on the GPU. We have a method for converting the fully blocked representation to the 3-block representation, but updateL! itself still strongly assumes the fully blocked representation.

`updateL!`, `ReMat` and `FeMat`

updateL! updates L using the matrix A (row-major packed lower triangle of hcat(Z,X,y)'hcat(Z,X,y), an efficient way of storing essentially the cross product of the model matrices) and the various ReMats, which are sections of the random-effects model matrix generated by a random-effects term. There is a lot of specialization around the storage and multiplication of these matrices, but my suspicion is that they can't be made too much more optimal by swapping to the GPU. Why? Because their only use in updateL is a call to copyscaleinflate!, which, as far as I know, isn't the slow spot in that function. It would be good to profile calls to updateL on moderately complex models to test this theory though.

Finally, one more matrix is used to store the fixed-effects model matrix, FeMat, which is just the fixed-effects model matrix with the response vector (y) appended as an additional column. FeMat is a very thin wrapper and its storage supports any AbstractMatrix, but we currently only support constructing a classic dense Matrix via the formula interface. For some types of very large but very sparse matrices, it might make sense to have an option to make this sparse.

But how to GPU?

Given all of this, how should we go about adding support for GPU?

First two observations:

the core part of the fitting process is the call to updateL!
L is currently stored as a vector of blocks, where each block can be any AbstractMatrix

Constructing a model that uses GPU storage is then relatively straightforward: we just use GPU matrices instead of CPU matrices in L. However, there are a few issues to overcome there:

GPU dependencies are very heavyweight, so I would prefer that GPU-specific code be placed into package extensions. This may mean that we need to define additional stubs at the package level that the extensions can then overload appropriately.
we need to expose this in an opt-in way through the constructor LinearMixedModel (and then make sure that fit forwards any relevant keyword arguments to the constructor).
appropriate methods for rankUpdate! will need to be defined
specialized methods for LinearAlgebra.mul! may also be relevant]
the organization into collections of sparse blocks may not be ideal for GPU (neither the number of small blocks nor their individual sparsity)

For the last bit, I strongly suspect that the 3-block-L formulation will be much more convenient for the GPU and may actually be faster in some cases on the CPU. (In particular, I suspect it will be faster on machines with very many cores and a huge amount of memory when dealing with very large models.) So first action item:

Tip

Expose an option to swap an entire model to the 3-block-L representation and implement an appropriate method of updateL to support this.

Having done that, my next steps would be:

provide an opt-in keyword argument to swap to GPU storage, which would always imply 3-block-L storage
write appropriate rankUpdateL! and mul! methods
explore when it makes sense to move which blocks to GPU -- maybe only one of the 3 blocks should be on the GPU?

For the discussion, please take advantage of multiple thread for different questions/comments.

Although this specialization enables certain speed ups, it also means that extending support to multimembership models -- such as lmerMultimember does for lme4 -- would require rather extensive changes in order to convert things to general sparse matrices. Similarly, the boost from moving to GPU-based methods may be best obtained by swapping to a general sparse representation on the GPU. On the GPU, it may just be faster to multiply everything rather than have the specialized branches that we do on the CPU because GPUs are comparatively bad at branches but fast at matrix multiplication. ↩
The in-place evaluation makes things very fast (no allocations!), but also creates a few problems for things like automatic differentiation. We use a gradient-free optimizer so this isn't a problem, but it might be interesting to provide a not in-place version for use in automatic differentiation applied to certain quantities of interest used in various derived quantities such as approximations to the denominator degrees of freedom. ↩
https://embraceuncertaintybook.com/largescaleobserved.html#sec-lrgobsmods ↩

palday · 2024-10-17T04:16:39Z

palday
Oct 17, 2024
Maintainer Author

One thing I like the approach I proposed here is that it opens us up to better supporting other matrix backends, such as sparse fixed effects or rectangular full-packed storage.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What would it take to add GPU support? #789

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

What would it take to add GPU support? #789

palday Oct 17, 2024 Maintainer

Background

Specialized sparse storage of blocks of L

updateL!, ReMat and FeMat

But how to GPU?

Footnotes

Replies: 1 comment

palday Oct 17, 2024 Maintainer Author

palday
Oct 17, 2024
Maintainer

Specialized sparse storage of blocks of `L`

`updateL!`, `ReMat` and `FeMat`

palday
Oct 17, 2024
Maintainer Author