Effects: double translation of functions and dynamic switching between direct-style and CPS code #1461

OlivierNicole · 2023-04-28T18:09:29Z

This feature makes programs that use OCaml 5 effects run faster in Javascript, by running as little continuation-passing style (CPS) code as possible. Based on an initial suggestion by @lpw25, we generate two versions for each functions that may be called from inside an effect handler (according to the existing static analysis): a direct-style version and a CPS version. At runtime, direct-style versions are used, except when entering an effect handler, in which case only CPS is run until the outermost effect handler is exited. This approach trades speed for program size, since—because a number of functions are compiled to two versions—the generated programs are bigger. For this reason, the feature is opt-in behind the --enable doubletranslate flag. This is a joint work with @vouillon.

We encountered a design difficulty: when functions are transformed into pairs of functions, it is unclear how to deal with captured identifiers when the functions are nested. To avoid this problem, functions that must be transformed are lambda-lifted, and thus no longer have any free variables except toplevel ones.

The transform is rather successful in preserving the performance of small / monomorphic programs.

~~I hypothesize that hamming is slower for the same reason as on current master: it uses lazy values, which are an obstacle for the global flow analysis.~~ Edit: I am not able to reproduce a slowdown on hamming in my latest round of benchmarks.
A number of micro-benchmarks are somewhat faster, maybe because the static analysis performed during the CPS transform is better at finding exact-arity calls.
I am not sure why fft is slightly slower, however. The generated codes look very similar.

The difference becomes negligible on large programs. CAMLboy is actually… 3 % faster with effects enabled (compared to 25 % slower previously): 520 FPS instead of 505 FPS, although the standard deviation is high at ~11 FPS, so it would be fair to say that the difference is not discernable.

ocamlc is not discernably slower, either (compared to 10 % slower previously).

#1461 (comment) breaks down which parts of program are actually made faster or slower, and why typical effect-using programs will be faster with this feature.

As some functions must be generated in two versions, the size of the generated code is larger (up to 76 % larger), a few percents larger when compressed.

Compiling ocamlc is about 70 % slower; the resulting file is 64 % larger when compressed.

A caveat of this approach is that all benefits are lost as soon as an effect handler is installed. This is an issue for scheduling libraries such as Eio, as they usually work by having an effect handler installed for the program’s entire lifetime. To mitigate this, we provide Js_of_ocaml.Js.Effect.assume_no_perform : (unit -> 'a) -> 'a. Evaluating assume_no_perform f runs the direct style version of f—the faster version. This also applies to transitive callees of f that do not use effect handlers. The programmer must ensure that these functions do not perform effects (not without installing a new effect handler).

OlivierNicole · 2023-04-28T18:14:17Z

I think this PR is best reviewed as a whole, not commit by commit.

OlivierNicole · 2023-04-28T21:43:56Z

It looks like these two effect handler benchmarks are slower with this PR, 18 % and 8 % slower, respectively. I need to spend some time on it to understand why.

	5.2.0	this PR
generators	5.593 s	6.651 s
chameneos	36.6 ms	39.5 ms

kayceesrk · 2023-04-30T12:39:45Z

Chameneos runs are too short. You should increase the input size. It takes the input size as a command line argument. Something that runs for a second is more representative as it eliminates the noise due to JIT and other constant time costs.

OlivierNicole · 2023-04-30T21:26:12Z

Good point. I find that chameneos is 9,8 % slower with this PR, 3.428 s versus 3.753 s.

My theory is that effect handlers are slightly slower due to the fact that function calls in CPS code cost an additional field access (applying f.cps instead of just f). So these benchmarks that use effect handlers intensively are unfavorable. However, I expect that programs whose execution mixes more usual code with some effect handling (i.e., programs that do not spent all of their time in effect handlers) will see their performance much improved by this PR, like the non-effect-using programs above.

kayceesrk · 2023-05-01T04:01:08Z

I agree with the reasoning and do not expect real programs to behave like generators or chameneos. The performance difference is small enough that I would consider the performance to be good enough for programs that heavily use effect handlers.

kayceesrk · 2023-05-01T04:04:19Z

Btw, the original numbers aren't useful to understand the improvements brought about by this PR. For this, you need 3 variants:

default
--enable=effects on master
--enable=effects on this PR

I'd be interested to see the difference between (2) and (3) in addition to the current numbers which show the difference between (1) and (3).

vouillon · 2023-05-02T14:33:47Z

My theory is that effect handlers are slightly slower due to the fact that function calls in CPS code cost an additional field access (applying f.cps instead of just f). So these benchmarks that use effect handlers intensively are unfavorable.

Note that f.cps(x1,...,xn) is a method call, which is somewhat slower than a plain function call. It might be faster to do the following instead:f.cps.call(null,x1,...,xn)

I had to do that in #1397:

js_of_ocaml/compiler/lib/generate.ml

Lines 954 to 959 in 3b3f66b

    
               (* Make sure we are performing a regular call, not a (slower) 
        
                  method call *) 
        
               match f with 
        
               | J.EAccess _ | J.EDot _ -> 
        
                   J.call (J.dot f (Utf8_string.of_string_exn "call")) (s_var "null" :: params) J.N 
        
               | _ -> J.call f params J.N

OlivierNicole · 2023-05-04T10:35:01Z

I believe that the form f.cps.call(null, x1, ..., xn) is already the one used.

Btw, the original numbers aren't useful to understand the improvements brought about by this PR. For this, you need 3 variants:
1. default

2. --enable=effects on `master`

3. --enable=effects on this PR
I'd be interested to see the difference between (2) and (3) in addition to the current numbers which show the difference between (1) and (3).

Here are the graphs showing the difference between --enable=effects on master (revision 5.2.0) and --enable=effects on this PR:

kayceesrk · 2023-05-04T11:30:40Z

Thanks. The execution time improvement is smaller than what I would have expected. Is that surprising to you or does it match your expectation?

Also, it would be useful to have all the variants plotted in the same graph with direct as the baseline.

OlivierNicole · 2023-05-04T12:45:29Z

It more or less matches my expectation. My reasoning is the following: on most of these small, monomorphic benchmarks, the static analysis will eliminate most CPS calls at compile time. Therefore, the dynamic switching will not change the run time a lot and maybe slightly worsen it. On benchmarks that heavily use effect handlers, I also expect the run time to be worse: most of the time is spent in CPS code anyway, the dynamic switching only adds overhead.

I therefore expect the biggest improvements to happen on larger programs, on which the static analysis does not work as well due to higher-order and mutability; and which do not spend most of their time in effect handlers.

If my hypothesis is verified, then the question is: is this trade-off acceptable? Keeping in mind that there might be ways to improve this PR to save more performance.

OlivierNicole · 2023-05-04T12:49:15Z

Also, it would be useful to have all the variants plotted in the same graph with direct as the baseline.

I have updated the PR message with new graphs showing all the variants.

OlivierNicole · 2023-05-15T10:11:50Z

I tried to build a benchmark that uses Domainslib, as an example of a more typical effect-using program. But the linker complains that the primitive caml_thread_initialize missing.

I tried to add it in a new runtime/thread.js file but I doesn’t seem to be taken into account, I’m not sure what is the way to add a primitive.

Also, when I build js_of_ocaml I’m getting a lot of primitive-related messages that I don’t really understand:

$ dune exec -- js_of_ocaml
Entering directory '/home/olivier/jsoo/js_of_ocaml'
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156

OlivierNicole · 2023-05-15T14:59:14Z

I tried to build a benchmark that uses Domainslib, as an example of a more typical effect-using program. But the linker complains that the primitive caml_thread_initialize missing.

I solved it by downgrading to lockfree 0.3.0 as suggested by @jonludlam. But the resulting program never completes. I assume that the mock parallelism provided by the runtime doesn’t suffice for using Domainslib—a “domain” must be stuck forever spinwaiting or something.

OlivierNicole · 2023-05-18T00:27:51Z

I think that this PR is ready for review. The only two problems that prevent the CI from being green are:

the Array.fold_left_map function being available only from 4.13. What is the policy in this case, do we add it to compiler/lib/stdlib.mlor do we avoid using it?
a stack overflow when running the testsuite with profile using-effects, which I have yet to investigate.

hhugo · 2023-05-18T06:49:15Z

the Array.fold_left_map function being available only from 4.13. What is the policy in this case, do we add it to compiler/lib/stdlib.ml or do we avoid using it?

Just add it to the stdlib module compiler/lib/stdlib.ml

compiler/lib/code.ml

hhugo · 2023-05-22T14:02:43Z

Lambda_lifting doesn't seem to be used anymore, is it expected ? Should Lambda_lifting_simple replace Lambda_lifting ?

hhugo · 2023-05-22T14:13:08Z

From a quick look at the PR, the benefit of such change is not clear, can you highlight examples where we see clear improvements ?

OlivierNicole · 2023-05-22T15:54:58Z

Lambda_lifting doesn't seem to be used anymore, is it expected ? Should Lambda_lifting_simple replace Lambda_lifting ?

As I discovered just today, Lambda_lifting is still relevant to avoid generating too deeply nested functions. I just pushed a commit that reinstates the post-CPS-transform Lambda_lifting.f pass. Therefore, Lambda_lifting is now used.

There are now two lambda lifting passes for two different reasons. Lambda_lifting and Lambda_lifting_simple do rather different things. The latter simply lifts functions to toplevel, but it takes as a parameters which functions to lift and returns some information about the lifted functions, to be used by the subsequent CPS transform. It also handles mutually recursive functions. Lambda_lifting does no such thing as they are not useful for its purpose, however the lifting threshold and baseline are configurable.

For this reason I am not convinced that there is an interest in merging the two modules.

From a quick look at the PR, the benefit of such change is not clear, can you highlight examples where we see clear improvements ?

I am convinced that most real-world effect-using programs will benefit from this PR, for the reasons given in my above message; but it’s hard to prove it, because we don’t have (yet) examples of such typical programs that work in Javascript. Programs using Domainslib don’t work well with js_of_ocaml (and are arguably not really relevant as JS is not a multicore language). Concurrency libraries like Eio are a more natural fit. I am currently trying to cook up a benchmark using the experimental JS backend for Eio.

hhugo · 2023-05-22T20:40:55Z

Given the size impact of this change, it would be nice to be able to disable (or control) this optimization. There are programs that would not benefit from this optimization, it be nice to not have to pay the size cost for no benefits.

The two lambda_lifting modules are confusing, we should either merge them (allowing to share duplicated code) or at least find better names.

Compiling ocamlc is about 70 % slower

Do you know where this come from ? does it mostly come from the new lambda lifting pass ? or are other passes heavily affected as well (generate, js_assign, ..)

compiler/tests-compiler/direct_calls.ml

compiler/tests-compiler/error.ml

compiler/tests-compiler/lambda_lifting.ml

compiler/tests-compiler/effects_toplevel.ml

compiler/lib/subst.mli

OlivierNicole · 2023-06-28T10:43:34Z

Thank you for the review and sorry for the response delay, I have been prioritizing another objective in the previous weeks.

One update is that there is no performance gain on programs that use Eio, which is a shame as it is expected to be one of the central uses of effects. More generally, when the program stays almost all the time within at least one level of effect handlers, there is essentially no performance gain as we run the CPS versions of every function. And I expect this programming pattern (installing a topmost effect handler at the beginning of the program) to be the most common with effects.

So it is unclear to me yet if the implementation of double translation can be adapted to accommodate this.

OlivierNicole · 2024-06-21T15:21:16Z

I pushed a commit which adds a new primitive, caml_assume_no_effects, which allows to guarantee that a function is called in its (faster) direct-style version, for optimization purposes. See commit message for details.

... dynamic switching between direct-style and CPS code. (ocsigen#1461)

OlivierNicole · 2024-10-10T16:03:01Z

Coming back to this after spending time on the merge between js_of_ocaml and wasm_of_ocaml.

I’ve rebased this PR on master, re-aligned things where needed, and made the CI happy again. It’s ready for review.

cc @vouillon

... dynamic switching between direct-style and CPS code. (ocsigen#1461)

Passing a function [f] as argument of `caml_assume_no_effects` guarantees that, when compiling with `--enable doubletranslate`, the direct-style version of [f] is called, which is faster than the CPS version. As a consequence, performing an effect in a transitive callee of [f] will raise `Effect.Unhandled`, regardless of any effect handlers installed before the call to `caml_assume_no_effects`, unless a new effect handler was installed in the meantime. Usage: ``` external assume_no_effects : (unit -> 'a) -> 'a = "caml_assume_no_effects" ... caml_assume_no_effects (fun () -> (* Will be called in direct style... *)) ... ``` When double translation is disabled, `caml_assume_no_effects` simply acts like `fun f -> f ()`. This primitive is exposed via `Js_of_ocaml.Js.Effect.assume_no_perform`.

vouillon · 2024-11-13T16:54:22Z

In the message of the second commit, you write:

As a consequence, performing an effect in a transitive callee
of [f] will raise Effect.Unhandled, regardless of any effect handlers
installed before the call to caml_assume_no_effects, unless a new
effect handler was installed in the meantime.

Maybe I'm missing something, but I don't see why this should be true. Performing an effect will result in calling function caml_perform_effect, which calls the function uncaught_effect_handler only if the fiber stack is currently empty.

In any case, you should add a test for that.

vouillon

Here is a first batch of comments

vouillon · 2024-11-13T14:30:03Z

compiler/lib/driver.ml

+    let p, trampolined_calls, in_cps = Effects.f ~flow_info:info ~live_vars p in
+    let p = if Config.Flag.double_translation () then p else Lambda_lifting.f p in
+    p, trampolined_calls, in_cps)


Suggested change

let p, trampolined_calls, in_cps = Effects.f ~flow_info:info ~live_vars p in

let p = if Config.Flag.double_translation () then p else Lambda_lifting.f p in

p, trampolined_calls, in_cps)

p

|> Effects.f ~flow_info:info ~live_vars

|> map_fst (if Config.Flag.double_translation () then Fun.id else Lambda_lifting.f))

vouillon · 2024-11-13T14:30:22Z

compiler/lib/driver.ml

  else
    ( p
    , (Code.Var.Set.empty : Effects.trampolined_calls)
-    , (Code.Var.Set.empty : Effects.in_cps) )
+    , (Code.Var.Set.empty : Code.Var.Set.t) )


Suggested change

, (Code.Var.Set.empty : Code.Var.Set.t) )

, (Code.Var.Set.empty : Effects.in_cps) )

vouillon · 2024-11-13T15:20:21Z

runtime/js/stdlib.js

+  if (d === 0) return f(...args);
+  else if (d < 0) {


Suggested change

if (d === 0) return f(...args);

else if (d < 0) {

if (d === 0) {

return f.apply(null, args);

} else if (d < 0) {

vouillon · 2024-11-13T15:20:33Z

runtime/js/stdlib.js

@@ -85,7 +85,7 @@ function caml_call_gen(f, args) {
      args[args.length - 1] = k;
      return caml_call_gen(g, args);
    };
-    return f.apply(null, args);
+    return f(...args);


Suggested change

return f(...args);

return f.apply(null, args);

vouillon · 2024-11-13T15:22:05Z

runtime/js/stdlib_modern.js

+  }
+  function caml_call_gen_cps(f, args) {
+    var n = f.cps.l >= 0 ? f.cps.l : (f.cps.l = f.cps.length);
+    if (n === 0) return f.cps(...args);


Do we need this line? (And the corresponding line in stdlib.js.)

vouillon · 2024-11-14T14:01:48Z

compiler/lib/lambda_lifting_simple.ml

+      let idx = Var.idx x in
+      if idx < Array.length var_depth then var_depth.(idx) <- depth)


The comparison should always be true since we are marking bound variables before modifying the body.

Suggested change

let idx = Var.idx x in

if idx < Array.length var_depth then var_depth.(idx) <- depth)

var_depth.(Var.idx x) <- depth)

vouillon · 2024-11-14T14:28:40Z

compiler/lib/lambda_lifting_simple.ml

+          , Var.Map.union (fun _ _ -> assert false) (snd lifters) (snd lifters') ) ))
+      else
+        (* We lift possibly mutually recursive closures (that are created by
+           contiguous statements) together. Isolated closures are lambda-lifted


When a function (or a set of recursive functions) has no free variable, we could just move it to toplevel.

vouillon · 2024-11-14T14:54:38Z

compiler/lib/lambda_lifting_simple.ml

+                    List.fold_left
+                      current_contiguous
+                      ~f:(fun st (_, _, pc, _) ->
+                        traverse ~to_lift var_depth st pc (depth + 1))
+                      ~init:st


We already called traverse recursively when pushing the function into current_contiguous.

Suggested change

List.fold_left

current_contiguous

~f:(fun st (_, _, pc, _) ->

traverse ~to_lift var_depth st pc (depth + 1))

~init:st

st

vouillon · 2024-11-14T15:08:56Z

compiler/lib/lambda_lifting_simple.ml

+                  match l with
+                  | i :: rem ->
+                      let rem', st = rewrite_body [] st rem in
+                      i :: rem', st
+                  | [] -> [], st)


Could you try to refactor the code so that we don't have three copies of this piece of code?
Maybe we could have this function rewrite_body that just traverse the block's body and separate helper functions to rewrite closures?

vouillon · 2024-11-14T15:10:33Z

compiler/lib/lambda_lifting_simple.ml

+                        i :: rem, st
+                    | [] -> [], st
+                  in
+                  ( List.map current_contiguous ~f:(fun (f, params, pc, args) ->


I suspect you are reversing the functions' order. This should not make any semantics difference, but it is probably better for debuggability to preserve this order.

OlivierNicole force-pushed the optim_effects branch 2 times, most recently from 5362ff4 to 91e352e Compare May 17, 2023 14:41

OlivierNicole marked this pull request as ready for review May 18, 2023 00:27