Optimizing `par` fast path #186

shwestrick · 2024-06-01T15:26:22Z

shwestrick
Jun 1, 2024
Maintainer

Collecting ideas for optimizing the fast path of ForkJoin.par, to further improve upon automatic parallelism management (APM) and close the performance gap with manually tuned code. For example, see #184 --- this yielded a small improvement.

Quite a few ideas have already been implemented, see Completed list below. Some benchmarking results are shown in one of the comments below (link), and the results are generally very good.

Any more ideas still to do?

Completed (last updated: July 2, 2024)

~~Reducing heap allocations~~ (DONE: #195)
As noted here, there seems to be an opportunity for reducing the number of heap allocations on the fast path. Reducing the number of heap allocations is beneficial along multiple dimensions:

Fewer instructions on the fast path.
- Each individual heap allocation is extremely cheap (just 2-3 instructions in the common case), but of course, every non-zero cost can accumulate.
- If removing heap allocations results in code paths with no allocation at all, then we get a secondary benefit: the limit checks along this path can be removed. The performance gap between an allocating vs non-allocating loop can be huge.
Fewer garbage collections (especially LGCs) and cheaper garbage collections.
- In other words, "less memory pressure" 👋 ✋ 👋
- This is likely the bulk of the performance improvement due to eliminating heap allocations.

~~Optimizing closure representations~~ (DONE: #193)
The code generated for functions that make use of parallelism has a hidden overhead: larger closures for recursive functions, specifically to store scheduler data. Some notes:

For example, we have observed in a parallel fib that number of arguments to each recursive call can be as many as 15 or more; these additional arguments appear to be due to a large closure (containing scheduler data) being flattened.
It's worth nothing that this data is only used in slow paths.

~~Avoid heap allocation for universal type to store join data~~ (DONE: #184)

~~Move return addresses off the call-stack~~ (DONE: #188)
At POPL, @JohnReppy mentioned an idea: we could move the additional return addresses of pcall out of the call-stack and instead store these in a static lookup table. For MLton/MPL specifically, we could do this using the frameInfo table. This likely wouldn't be terribly difficult to implement, and might yield a small improvement.

~~Primitive support for heartbeat tokens~~ (DONE: #187)
MPL queries the number of current spare heartbeat tokens at every pcall. This is part of the token management algorithm described in the APM paper. To implement this, we currently make a C call to GC_currentSpareHeartbeats on the fast path. See here. This could be optimized by turning currentSpareHeartbeats into a _prim, which could then be implemented by directly reading s->spareHeartbeats from the GCState. Going even further, we could consider locally caching (similar to StackTop and Frontier in the generated code) the current number of heartbeat tokens, to keep track of this value in a register and make these queries extremely fast. The cost of losing a register might not be worth it, though.

shwestrick · 2024-06-02T16:34:02Z

shwestrick
Jun 2, 2024
Maintainer Author

I implemented the primitive support for heartbeat tokens here: #187. Not sure how much impact it has on performance, but I didn't really see any downsides, and it was easy to implement. In extreme cases it should be able to give us a few percent improvement.

0 replies

MatthewFluet · 2024-06-02T19:02:12Z

MatthewFluet
Jun 2, 2024
Maintainer

Following up on the discussion of allocations on the par fast path from #184 (comment).

Here's the SSA2 code for the fast path of pcallFork (i.e., that omits the spare heartbeat tokens query):

fun fib_0 (x_2994: word64,
          env_30: ...):
 {returns = Some (word64), raises = Some (exn)} =
L_1628 ()
 block L_1628 ()
   val x_6441: word64 = #0 env_30
   val x_6455: ... = #1 env_30
   val x_6454: ... = #2 env_30
   val x_6453: ... = #3 env_30
   val x_6452: ... = #4 env_30
   val x_6451: ... = #5 env_30
   val x_6450: ... = #6 env_30
   val x_6449: (bool mut) tuple = #7 env_30
   val x_6448: (list_3 mut) tuple = #9 env_30
   val x_6447: ... = #10 env_30
   val x_6446: ... = #11 env_30
   val x_6445: ... = #12 env_30
   val x_6444: ... = #13 env_30
   val x_6443: ... = #14 env_30
   val x_6442: ... = #15 env_30
   val x_2995: bool = prim WordS64_lt (x_6441, x_2994)
   case x_2995 of
     true => L_1630 | false => L_1629
 block L_1630 ()
   pcall leftSide_0 (env_30, x_2994) {cont = L_1631,
                                      parl = L_1632,
                                      parr = L_1633}
 block L_1631 (fres_0: result_2)
   case fres_0 of
     Finished_2 => L_3110 | Raised_2 => L_3109
 block L_3110 (x_6456: (word64) Finished_2)
   val x_6457: word64 = #0 x_6456
   val x_2998: word64 = prim Word64_sub (x_2994, global_29 (*0x2:w64*))
   val x_2997: bool = prim WordS64_subCheckP (x_2994, global_29 (*0x2:w64*))
   case x_2997 of
     true => L_1637 | false => L_1636
 block L_1636 ()
   call L_1638 (fib_0 (x_2998, env_30)) handle _ => raise
 block x_3000 (x_3002: word64, x_3001: word64)
   val x_3004: word64 = prim Word64_add (x_3001, x_3002)
   val x_3003: bool = prim WordS64_addCheckP (x_3001, x_3002)
   case x_3003 of
     true => L_1637 | false => L_1639
 block L_1639 ()
   return (x_3004)
fun leftSide_0 (x_3484: ...,
               x_3483: word64):
 {returns = Some (result_2), raises = None} =
L_1896 ()
 block L_1896 ()
   val x_3486: word64 = prim Word64_sub (x_3483, global_23 (*0x1:w64*))
   val x_3485: bool = prim WordS64_subCheckP (x_3483, global_23 (*0x1:w64*))
   case x_3485 of
     true => L_1898 | false => L_1897
 block L_1897 ()
   call L_1900 (fib_0 (x_3486, x_3484)) handle _ => L_1899
 block L_1900 (x_3489: word64)
   val x_6900: (word64) Finished_2 = obj Finished_2 (x_3489)  (* *** OBJECT ALLOCATION *** *)
   val x_3490: result_2 = inj (x_6900): result_2
   return (x_3490)

There is just one object allocation corresponding to:

Finished (f ())

coming from

fun leftSide () = Result.Finished (f ()) handle e => Result.Raised e

leftSide remains a top-level function, because it is the target of a pcall; the 'a Result.result is necessary to "carry" the result of the left-side computation back from the pcall. We might be able to "upgrade" PCall to behave more like Call and allow exceptions to be raised, but I think that is tricky, because in the case that the right-side computation is stolen but the left-side computation raises, then we still need to synchronize to "end" the stolen right-side computation (even though we will propagate the left-side exception); I'm inferring that from the fact that leftSideParCont performs the syncEndAtomic before the Result.extractResult fres.

Here's the SSA2 code for the fast path of greedyWorkAmortizedFork (i.e., that queries the spare heartbeat tokens and explicitly forks when there are spare heartbeat tokens):

fun fib_0 (x_3048: word64,
          env_30: ...):
 {returns = Some (word64), raises = Some (exn)} =
L_1644 ()
 block L_1644 ()
   val x_6637: word64 = #0 env_30
   val x_6657: word32 = #1 env_30
   val x_6656: ... = #2 env_30
   val x_6655: ... = #3 env_30
   val x_6654: ... = #4 env_30
   val x_6653: ... = #5 env_30
   val x_6652: ... = #6 env_30
   val x_6651: ... = #7 env_30
   val x_6650: (bool mut) tuple = #8 env_30
   val x_6649: (list_3 mut) tuple = #11 env_30
   val x_6648: ... = #12 env_30
   val x_6647: ... = #13 env_30
   val x_6646: ... = #14 env_30
   val x_6645: ... = #15 env_30
   val x_6644: ... = #16 env_30
   val x_6643: ... = #17 env_30
   val x_6642: t_5 = #19 env_30
   val x_6641: ... = #20 env_30
   val x_6640: (word64 mut) sequence = #21 env_30
   val x_6639: (word64 mut) sequence = #22 env_30
   val x_6638: (word64 mut) sequence = #23 env_30
   val x_3049: bool = prim WordS64_lt (x_6637, x_3048)
   case x_3049 of
     true => L_1646 | false => L_1645
 block L_1646 ()
   val x_3053: (..., word64) tuple = obj (env_30, x_3048)  (* *** OBJECT ALLOCATION *** *)
   val x_3052: cpointer = prim GC_state ()
   val x_3051: word32 =
     prim CFunction {args = (cpointer),
                     convention = cdecl,
                     inline = false,
                     kind = Impure,
                     prototype = {args = (CPointer), res = Some Word32},
                     return = word32,
                     symbolScope = private,
                     target = GC_currentSpareHeartbeats} (x_3052)
   val x_3050: bool = prim WordU32_lt (x_3051, x_6657)
   case x_3050 of
     true => L_1648 | false => L_1647
 block L_1648 ()
   pcall leftSide_0 (x_3048, env_30) {cont = L_1649,
                                      parl = L_1650,
                                      parr = L_1651}
 block L_1649 (fres_0: result_0)
   case fres_0 of
     Finished_2 => L_3208 | Raised_2 => L_3207
 block L_3208 (x_6658: (word64) Finished_2)
   val x_6659: word64 = #0 x_6658
   val x_3056: word64 = prim Word64_sub (x_3048, global_29 (*0x2:w64*))
   val x_3055: bool = prim WordS64_subCheckP (x_3048, global_29 (*0x2:w64*))
   case x_3055 of
     true => L_1655 | false => L_1654
 block L_1654 ()
   call L_1656 (fib_0 (x_3056, env_30)) handle _ => raise
 block L_1656 (x_3057: word64)
   goto x_3058 (x_3057, x_6659)
 block x_3058 (x_3060: word64, x_3059: word64)
   val x_3062: word64 = prim Word64_add (x_3059, x_3060)
   val x_3061: bool = prim WordS64_addCheckP (x_3059, x_3060)
   case x_3061 of
     true => L_1655 | false => L_1657
 block L_1657 ()
   return (x_3062)
fun leftSide_0 (x_3704: word64,
               x_3703: ...):
 {returns = Some (result_0), raises = None} =
L_2002 ()
 block L_2002 ()
   val x_3706: word64 = prim Word64_sub (x_3704, global_23 (*0x1:w64*))
   val x_3705: bool = prim WordS64_subCheckP (x_3704, global_23 (*0x1:w64*))
   case x_3705 of
     true => L_2004 | false => L_2003
 block L_2003 ()
   call L_2006 (fib_0 (x_3706, x_3703)) handle _ => L_2005
 block L_2006 (x_3709: word64)
   val x_7242: (word64) Finished_2 = obj Finished_2 (x_3709)  (* *** OBJECT ALLOCATION *** *)
   val x_3710: result_0 = inj (x_7242): result_0
   return (x_3710)

There are just two object allocations. One is the same Finished (f ()) object as discussed before. The other is a closure for the fn _ => fib (n - 2) right-side computation. This is explicitly packaged as a closure so that it can be captured in the g' closure of doSpawnFunc; in turn, g' needs to be explicitly packaged as a closure so that it can be stored in a NormalTask.

This makes a certain amount of sense. We can rewrite the fib computation to make the order of operations more explicit:

fun fib n =
 if n <= grain then sfib n
 else
   let
     val l = fn _ => fib (n - 1)
     val r = fn _ => fib (n - 2)
     val (x,y) = ForkJoin.par (l, r)
   in
     x + y
   end

Certainly, right after closure conversion, the fn _ => fib (n-1) and fn _ => fib (n-2) closures will be created right there, right after the n <= grain test and right before the GC_currentSpareHeartbeats at the beginning of greedyWorkAmortizedFork. After enough inlining and optimization, the fn _ => fib (n-1) closure is entirely eliminated and all of the explicit calls of the fn _ => fib (n-2) closure are able to directly access the fib environment (to make the recursive call) and the n value (to calculate n-2) without going through the closure value.

However, the need for the explicit closure for use in (the inlined code for) maybeSpawnFunc down the else branch of greedyWorkAmortizedFork remains and so the closure remains. We're missing a code-motion optimization that moves the object allocation down into the else branch and off of the fast path. This isn't unique to pcall; this would be a generally useful optimization for MLton. There are very simple "Code/Instruction Sink" optimizations up to the very powerful Lazy Code Motion and Partial Redundancy Elimination optimizations that achieve this.

Note that code motion can increase register pressure (by extending the live range of variables), so good code-motion algorithms try to take into account the liveness ranges of variables contributing to the code to be moved. This is especially important in a GCed language; for example, unconditionally sinking a Vector_sub operation might keep the whole vector live longer than it was before, but if the vector is necessary live at the code destination, then it won't change the GC behavior (or increase register pressure). This shouldn't be a problem for the code above, since the fn _ => fib (n-2) closure is passed-to/used-by both the maybeSpawnFunc and the syncEndAtomic down the else branch of greedyWorkAmortizedFork.

0 replies

MatthewFluet · 2024-06-04T01:31:58Z

MatthewFluet
Jun 4, 2024
Maintainer

I implemented storing the PCall alternate return addresses in the struct GC_frameInfo of a PCALL_CONT_FRAME here: #188. As expected, this wasn't difficult to implement and might yield a small improvement.

0 replies

shwestrick · 2024-06-05T13:03:06Z

shwestrick
Jun 5, 2024
Maintainer Author

I've been experimenting with an alternative approach to implementing par, approximately along these lines:

fun par(f, g) =
  let
    fun f'() =
      ( if currentSpareHeartbeatTokens() = 0 then
          ()
        else
          tryPromoteNow()
      
      ; f()
      )
  in
    pcallFork(f', g)
  end

This succeeds in avoiding the closure allocation on the fast path in simple codes (e.g., fully parallel fib), and therefore could be a simpler alternative to implementing a code motion optimization (such as suggested in #186 (comment))

However, a problem: with this change, the delaunay-ng benchmark gets slower, by approximately 10-20%. Why? My best understanding is that the cost of tryPromoteNow() is slower than the normal eager forking path, causing the span to increase even more, and this benchmark already is span-limited. (The benchmark consists of many small parallel loops back-to-back, where each loop individually is small enough to be span-limited.)

I tried optimizing the tryPromoteNow() code to be faster, in particular by implementing the "youngest promotion optimization" that we've considered before. But, I haven't yet been able to match the performance of the current implementation of par (greedyWorkAmortizedFork) on this benchmark.

0 replies

shwestrick · 2024-07-02T17:29:51Z

shwestrick
Jul 2, 2024
Maintainer Author

As of today (July 2, 2024), we've made a number of improvements:

I've measured the performance impact of these changes on various benchmarks from parallel-ml-bench (hb branch). These results compare the most recent commit on main (987d0c3) versus the previous release (v0.5).

In general, the results are really good: we see a big performance improvement on *-ng benchmarks, and not much difference on manually grain-controlled benchmarks.

Improvement since v0.5 on `*-ng` benchmarks

On *-ng benchmarks, which have little-to-no granularity control, the improvements can be quite significant: as much as 70% on 1 core, and up to 50% on 80 cores.

A few benchmarks got slightly worse: suffix-array-ng, bfs-ng, and delaunay-ng, with as much as 10-15% loss. I believe the performance losses here are due to #195 which fixed a bug in the scheduler related to the "eager forking" optimization. Note that MPL-v0.5 has this bug.

(This plot shows the performance ratio MPL_v0.5 / MPL_current_commit; higher is better. Note that the legend omits the -ng suffix on the benchmark names.)

Improvement since v0.5 on manually grain-controlled benchmarks

On manually controlled benchmarks, we're generally seeing ±5% difference, i.e., these changes don't have a huge impact on manually controlled benchmarks.

A few benchmarks got as much as 10% slower. Similar to above, my guess is that any performance losses here are due primarily to #195 which fixed a bug in the scheduler. Note that MPL-v0.5 has this bug.

It's interesting to see that suffix-array seems to have gotten significantly faster. Such a big swing is likely due to compilation differences.

(This plot shows the performance ratio MPL_v0.5 / MPL_current_commit; higher is better.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing `par` fast path #186

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Optimizing par fast path #186

shwestrick Jun 1, 2024 Maintainer

Completed (last updated: July 2, 2024)

Replies: 5 comments

shwestrick Jun 2, 2024 Maintainer Author

MatthewFluet Jun 2, 2024 Maintainer

MatthewFluet Jun 4, 2024 Maintainer

shwestrick Jun 5, 2024 Maintainer Author

shwestrick Jul 2, 2024 Maintainer Author

Improvement since v0.5 on *-ng benchmarks

Improvement since v0.5 on manually grain-controlled benchmarks

Optimizing `par` fast path #186

shwestrick
Jun 1, 2024
Maintainer

shwestrick
Jun 2, 2024
Maintainer Author

MatthewFluet
Jun 2, 2024
Maintainer

MatthewFluet
Jun 4, 2024
Maintainer

shwestrick
Jun 5, 2024
Maintainer Author

shwestrick
Jul 2, 2024
Maintainer Author

Improvement since v0.5 on `*-ng` benchmarks