Replies: 1 comment
-
One thing I like the approach I proposed here is that it opens us up to better supporting other matrix backends, such as sparse fixed effects or rectangular full-packed storage. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Note
For now, I'll focus on the linear mixed model. While the generalized linear mixed model uses a
LinearMixedModel
internally for storage, re-implementing PIRLS and nAQG on the GPU would require a quite different approach than the direct computation involved in each step of fitting the linear mixed model.Background
Broadly speaking, MixedModels.jl's speed on the CPU comes from extensive specialization on matrices with a particular form. Multiple dispatch is very convenient for this: we can specialize the matrix multiplication
A*B
not just for general sparsity but for the particular sparsity pattern ofA
andB
. This is one of the advancements relative to its predecessor in R, lme4, though both rely on the same basic insight (formulating the model fitting problem as a penalized least squares problem that sparse methods can be applied to instead of the generalized least squares problem that software likenlme
uses). In particular, we specialize on the sparsity patterns that occur in classical single-membership models: diagonal matrices and block sparse matrices.1Specialized sparse storage of blocks of
L
The specialized sparsity patterns are represented using three types:
Diagonal
from the LinearAlgebra stdlibBlockedSparse
UniformBlockDiagonal
Various specializations of
LinearAlgebra.mul!
for these types and their adjoints are defined inlinalg.jl
. Additionally,linalg/rankUpdate.jl
definesrankupdate!
methods for combinations of these functions. Realistically, we could replace most calls torankUpdate!
and thus the associated methods to calls to 5-argumentmul!
and "just" provide a few additional methods ofmul!
for our specialized types. However,rankUpdate!
predates 5-argumentmul!
by several Julia versions and we never got around to seeing if it would be faster to swap. Finally,linalg/logdet.jl
andlinalg/cholUnblocked.jl
provide specialized implementations of an in-place log determinant and blocked Cholesky factorization, respectively.2All of these types are used to represent blocks of
L
, the lower Cholesky factor, at the heart of the parameter estimation. Incredibly, the biggest computational expense in each step of the optimization process is the call toupdateL!
-- the actual evaluation of the objective function afterupdateL!
is extremely fast, even for extremely large and complex models.Note
L
is currently stored in "fully blocked" form. You can explore the structure of blocks for a given model by examiningBlockDescription(model)
. However, the block structure may be simplified into three blocks:This representation is actually not that much less efficient for some things. For a crossed design with two grouping variables, there is already a lot of fill-in such that the L[2,2] block is already dense (and we know that
updateL!
spends most of its time there3. Moreover, it can sometimes be faster to use dense matrices, even for sparse data, because of vectorization and costs of memory access -- this holds doubly so for GPU. In other words, it maybe worthwhile examining the 3-block-L representation as the representation for use on the GPU. We have a method for converting the fully blocked representation to the 3-block representation, butupdateL!
itself still strongly assumes the fully blocked representation.updateL!
,ReMat
andFeMat
updateL!
updatesL
using the matrixA
(row-major packed lower triangle ofhcat(Z,X,y)'hcat(Z,X,y)
, an efficient way of storing essentially the cross product of the model matrices) and the variousReMat
s, which are sections of the random-effects model matrix generated by a random-effects term. There is a lot of specialization around the storage and multiplication of these matrices, but my suspicion is that they can't be made too much more optimal by swapping to the GPU. Why? Because their only use inupdateL
is a call tocopyscaleinflate!
, which, as far as I know, isn't the slow spot in that function. It would be good to profile calls toupdateL
on moderately complex models to test this theory though.Finally, one more matrix is used to store the fixed-effects model matrix,
FeMat
, which is just the fixed-effects model matrix with the response vector (y) appended as an additional column.FeMat
is a very thin wrapper and its storage supports anyAbstractMatrix
, but we currently only support constructing a classic denseMatrix
via the formula interface. For some types of very large but very sparse matrices, it might make sense to have an option to make this sparse.But how to GPU?
Given all of this, how should we go about adding support for GPU?
First two observations:
updateL!
AbstractMatrix
Constructing a model that uses GPU storage is then relatively straightforward: we just use GPU matrices instead of CPU matrices in L. However, there are a few issues to overcome there:
LinearMixedModel
(and then make sure thatfit
forwards any relevant keyword arguments to the constructor).rankUpdate!
will need to be definedLinearAlgebra.mul!
may also be relevant]For the last bit, I strongly suspect that the 3-block-L formulation will be much more convenient for the GPU and may actually be faster in some cases on the CPU. (In particular, I suspect it will be faster on machines with very many cores and a huge amount of memory when dealing with very large models.) So first action item:
Tip
Expose an option to swap an entire model to the 3-block-L representation and implement an appropriate method of
updateL
to support this.Having done that, my next steps would be:
rankUpdateL!
andmul!
methodsFor the discussion, please take advantage of multiple thread for different questions/comments.
Footnotes
Although this specialization enables certain speed ups, it also means that extending support to multimembership models -- such as lmerMultimember does for lme4 -- would require rather extensive changes in order to convert things to general sparse matrices. Similarly, the boost from moving to GPU-based methods may be best obtained by swapping to a general sparse representation on the GPU. On the GPU, it may just be faster to multiply everything rather than have the specialized branches that we do on the CPU because GPUs are comparatively bad at branches but fast at matrix multiplication. ↩
The in-place evaluation makes things very fast (no allocations!), but also creates a few problems for things like automatic differentiation. We use a gradient-free optimizer so this isn't a problem, but it might be interesting to provide a not in-place version for use in automatic differentiation applied to certain quantities of interest used in various derived quantities such as approximations to the denominator degrees of freedom. ↩
https://embraceuncertaintybook.com/largescaleobserved.html#sec-lrgobsmods ↩
Beta Was this translation helpful? Give feedback.
All reactions