Optimizing par
fast path
#186
Replies: 5 comments
-
I implemented the primitive support for heartbeat tokens here: #187. Not sure how much impact it has on performance, but I didn't really see any downsides, and it was easy to implement. In extreme cases it should be able to give us a few percent improvement. |
Beta Was this translation helpful? Give feedback.
-
Following up on the discussion of allocations on the Here's the SSA2 code for the fast path of
There is just one object allocation corresponding to:
coming from
Here's the SSA2 code for the fast path of
There are just two object allocations. One is the same This makes a certain amount of sense. We can rewrite the
Certainly, right after closure conversion, the However, the need for the explicit closure for use in (the inlined code for) Note that code motion can increase register pressure (by extending the live range of variables), so good code-motion algorithms try to take into account the liveness ranges of variables contributing to the code to be moved. This is especially important in a GCed language; for example, unconditionally sinking a |
Beta Was this translation helpful? Give feedback.
-
I implemented storing the PCall alternate return addresses in the |
Beta Was this translation helpful? Give feedback.
-
I've been experimenting with an alternative approach to implementing fun par(f, g) =
let
fun f'() =
( if currentSpareHeartbeatTokens() = 0 then
()
else
tryPromoteNow()
; f()
)
in
pcallFork(f', g)
end This succeeds in avoiding the closure allocation on the fast path in simple codes (e.g., fully parallel fib), and therefore could be a simpler alternative to implementing a code motion optimization (such as suggested in #186 (comment)) However, a problem: with this change, the I tried optimizing the |
Beta Was this translation helpful? Give feedback.
-
As of today (July 2, 2024), we've made a number of improvements:
I've measured the performance impact of these changes on various benchmarks from In general, the results are really good: we see a big performance improvement on Improvement since v0.5 on
|
Beta Was this translation helpful? Give feedback.
-
Collecting ideas for optimizing the fast path of
ForkJoin.par
, to further improve upon automatic parallelism management (APM) and close the performance gap with manually tuned code. For example, see #184 --- this yielded a small improvement.Quite a few ideas have already been implemented, see Completed list below. Some benchmarking results are shown in one of the comments below (link), and the results are generally very good.
Any more ideas still to do?
Completed (last updated: July 2, 2024)
Reducing heap allocations(DONE: #195)As noted here, there seems to be an opportunity for reducing the number of heap allocations on the fast path. Reducing the number of heap allocations is beneficial along multiple dimensions:
Optimizing closure representations(DONE: #193)The code generated for functions that make use of parallelism has a hidden overhead: larger closures for recursive functions, specifically to store scheduler data. Some notes:
fib
that number of arguments to each recursive call can be as many as 15 or more; these additional arguments appear to be due to a large closure (containing scheduler data) being flattened.Avoid heap allocation for universal type to store join data(DONE: #184)Move return addresses off the call-stack(DONE: #188)At POPL, @JohnReppy mentioned an idea: we could move the additional return addresses of
pcall
out of the call-stack and instead store these in a static lookup table. For MLton/MPL specifically, we could do this using theframeInfo
table. This likely wouldn't be terribly difficult to implement, and might yield a small improvement.Primitive support for heartbeat tokens(DONE: #187)MPL queries the number of current spare heartbeat tokens at every
pcall
. This is part of the token management algorithm described in the APM paper. To implement this, we currently make a C call toGC_currentSpareHeartbeats
on the fast path. See here. This could be optimized by turningcurrentSpareHeartbeats
into a_prim
, which could then be implemented by directly readings->spareHeartbeats
from theGCState
. Going even further, we could consider locally caching (similar toStackTop
andFrontier
in the generated code) the current number of heartbeat tokens, to keep track of this value in a register and make these queries extremely fast. The cost of losing a register might not be worth it, though.Beta Was this translation helpful? Give feedback.
All reactions