Lesson 12: Dynamic Compilers #317

sampsyo · 2022-04-11T14:20:56Z

sampsyo
Apr 11, 2022
Maintainer

Here's the thread for the dynamic compilers task, which involves doing some speculative transformations on Bril IR!

chhzh123 · 2022-04-21T03:33:14Z

chhzh123
Apr 21, 2022

In this task, I continue using my bril-py interpreter and build a tracing-based JIT on top of it. The main part of my JIT can be found here. It is also very easy to add speculative execution support to my interpreter, which only needs to store the original data frame of the program and restore it when the guard function goes false. It takes less than 10 lines of code to implement.

Tracing-Based JIT

I simply start tracing from the beginning of the program and add instructions to the trace based on which operation the JIT meets:

For jmp, we simply eliminate it, since we only generate straight-line codes.
For br, we need to add a guard instruction to the trace, but the condition should be reverted first if we take the false branch, which means a not instruction is needed before the guard one.
For call, the procedure is similar to function inlining (which I've done in Lesson 2). In the beginning, the arguments should be copied from the caller function using id, and the variables inside the callee function need to be renamed. There are no call instructions in the trace, and the code inside the function will be flattened into straight-line codes.
For ret, we use id to copy the return value back to the caller function. These redundant instructions will be eliminated by a future DCE pass.
All other instructions are directly added to the trace.

After we obtain the trace, we can call the LVN and DCE passes to optimize it. I then provide a tranform.py script to insert the trace back to the original program and add speculative markers to it. Since I trace the whole program, the whole trace can be directly put in front of the program. guard's exit will become the beginning of the original program, and the next instruction of commit should be jmp to the end of the original program.

Finally, we can take the program with optimized trace and re-execute it.

Testing

I again took several test programs from previous lessons, JIT executed, and observed their performance. For demonstration, I only use two test cases here.

The first case is the example on the class, which involves function calls (interprocedural optimization).

@f(a: int) :int {
  one: int = const 1;
  b: int = sub a one;
  ret b;
}

@g(a: int) :int {
  one: int = const 1;
  b: int = add a one;
  ret b;
}

@main(x: int) {
  one: int = const 1;
  y: int = add x one;
  cst: int = const 100;
  cond: bool = lt x cst;
  br cond .true .false;
.true:
  z: int = call @f y;
  jmp .exit;
.false:
  z: int = call @g y;
  jmp .exit;
.exit:
  print z;
}

After the trace was generated, I found it was very tricky to do optimization with LVN. Our previous implementation of LVN actually cannot work for this case, since not all the arguments in the instruction are constants. We cannot simply apply constant folding or constant propagation to this case. Some symbolic equivalence testing should be supported. For a simple workaround, I extend the LVN pass to let it check the former referenced instruction, and see if the previous one is a complementary instruction（e.g., add and sub, mul and div). If so, we further check if the second constants are the same. If they add and subtract the same constant, then we replace the latter instruction with an id instruction which directly copies the data from the original variable. With DCE, this case can be later eliminated.

y = x + 1;
a = y;
z = a - 1;

Finally, we obtain the program with optimized trace.

@main(x: int) {
  speculate;
  cst: int = const 100;
  cond: bool = lt x cst;
  guard cond .trace_entry;
  print x;
  commit;
  jmp .trace_exit;
.trace_entry:
  one: int = const 1;
  y: int = add x one;
  cst: int = const 100;
  cond: bool = lt x cst;
  br cond .true .false;
.true:
  z: int = call @f y;
  jmp .exit;
.false:
  z: int = call @g y;
  jmp .exit;
.exit:
  print z;
.trace_exit:
}

The result is shown below. We can see that the two programs indeed generate the same result, and our traced program has fewer instructions than the original one.

> python3 bvm.py -f test/demo.json 42
42
# of instructions: 11
> python3 bvm.py -f test/demo.opt.json 42
42
# of instructions: 7

But tracing the whole program incurs overheads. If we switch to another value no smaller than 100, we may need to speculatively execute the traced code first, and then roll back to the beginning of the program and execute again. So in this case, the number of executed instructions increases.

> python3 bvm.py -f test/demo.json 42
42
# of instructions: 11
> python3 bvm.py -f test/demo.opt.json 100
102
# of instructions: 15

The second case involves a loop. The test program can be found here. The traced program is exactly the same as loop unrolling, so no loop overhead is introduced after tracing. The number of instructions is greatly reduced as shown below, which shows the effectiveness of my JIT compiler. However, the downside of this is that the size of the generated program will be large, but it can be tackled by changing the starting point of the trace to the beginning of the loop.

> python3 bvm.py -f test/loopcond.json
1984
# of instructions: 117
> python3 bvm.py -f test/loopcond.opt.json
1984
# of instructions: 79

1 reply

sampsyo Apr 21, 2022
Maintainer Author

Awesome work!! I really like the illustration on a version of the program that we used in lecture. And thanks also for discussing the way you had to extend the LVN optimization to make the add and sub operations "cancel out" as we suggested in class. Super cool to see it take fewer instructions on one input and more on the other!

It would also be interesting to see whether there are any effects "longitudinally," i.e., on other benchmarks that are not specifically engineered for this task.

JonathanDLTran · 2022-04-21T20:20:51Z

JonathanDLTran
Apr 21, 2022

Summary

Tracing-Stitching-Code and Tracing-Recording-Code.

To run, run yarn build, yarn unlink and then yarn link after adding brili-tr to the config and build files. After this you can use the command brench trace.toml > results-trace.csv to see testing results. This runs all the benchmarks with the tracing program. To manually run a single test, you can do bril2json < test-program | brili-tr {args} | python3 trace.py | brili -p {args} to see the output of the test program, and also see dynamic bril instruction count.

I implemented an ahead of time tracing optimizer. I go to the first non-main function in the program, trace until I hit the end of the function, e.g. a return, a call to another function, a call to print, or a memory operation. I then use this trace and optimize it using LVN, then stitch it back into the program. To stich, I simply add a speculate command, replace all branches with guards as appropriate, remove jumps, and finally add a commit. After, I add a jump to the correct location in the program, as required.

To test, I used the given benchmarks, and checked whether the results were the same before and after running the trace program. I had to fix a few bugs, such as how my LVN failed to handle floating point operations. This forced me to bail on optimizing floating point operations. It would likely be a quick fix to handle some of these floating point operations for LVN, modulo some algebraic simplifications that might cause accuracy problems.

Another bug came up based on how I used guard expressions. Since the guard only fails when the condition is false, I had to track the runtime value of branches, and manually negate the condition (when the false branch is taken in tracing) so that the guard works properly when stitched into the original program.

Results

Results were not great. The trace optimizer performs worse on every single benchmark. This is likely because my traces are very short, based on the conditions I have listed above. The short length of the trace does not allow lvn to do much optimization. It is also likely other optimizations besides LVN could be used as well, such as DCE, but I did not add this. Perhaps trying other optimizations like DCE would lead to more fruitful results.

I also created some contrived test cases, which also show the same behavior, where the trace optimizer performs worse, even on long traces. I believe this may be an issue with how I am doing LVN, and the next thing I would do fix to improve the trace optimizer would be to identify opportunities to remove redundancies, and recognize why LVN is not removing redundancies.

Furthermore, my tracing is not interprocedural. I made this decision because I was not sure how to make guards jump to the right label in the correct function if the guard failed. If the tracing was interprocedural, the inlined code would allow for longer regions of code that could be optimized using LVN. It is likely some more code would be either redundant or removed entirely with longer traces.

Finally, tracing only occurs in 1 function, for simplicity of implementation. It is likely that the first function I choose top optimise does not present many chances for optimzation to occur. It would be good to trace in as many functions as possible, then optimise all of these traces and stitch them into the original program. This would give as many chances as possible to optimize trace.

2 replies

sampsyo Apr 21, 2022
Maintainer Author

Nice; thanks for the detailed writeup!! This is very clear and makes it obvious what's going on. I think it's a reasonable decision to make the scope of traces fairly small at first, even if this meant that you didn't actually end up with any speedups from this version. At least by seeing that the programs got "slower" you can be sure that your pass is doing something! 😃

Good point about the danger of floating-point optimizations. Extending LVN to work well/correctly with FP ops could be interesting… one thing to try could be a "fast-math" mode that is allowed to break programs (i.e., make them less accurate) at the cost of occasional speedup.

charles-rs Apr 22, 2022

hmmm i feel like i've red something about this idea of breaking programs a little bit to get a speedup on the page of a professor here...

charles-rs · 2022-04-22T03:28:30Z

charles-rs
Apr 22, 2022

Code here.

what happened

took it easy on this one as i had a lot of other stuff to do, but still was a good time :)

step 1 was making the traces, which wasn't as trivial as a thought. i initially was tracing the whole execution, and planning to have my compiler throw out unused stuff, but turns out this was like 70 megs for the ackermann test, so that's a bad idea. Eventually i settled on only tracing the main function, and stopping tracing at calls and back edges. Kept track of labels so that we know where to jump to when speculation finishes

Reading in the trace, i stopped early at prints as well, and figured out which label represented the "end" of the trace (sometimes needed to round down and lose some traced instructions, as i didn't fancy inserting new labels in the middle of blocks). Then, get rid of jumps and internal labels to the trace, and fix up branches. This required an extra temp for when the branch test was false, but was relatively straight forward to turn them into guards.

Then, i translated the main function to start with a speculate, have the repaired trace, then a commit followed by a jump to the label representing the end of the trace. After this, i have a label for all of the guards to fail to, and then the original code.

results

so it did work, it just made everything worse . This makes sense though since i didn't optimize the trace, and only traced main. the fact that none of the programs broke really showcases how little i've accomplished as i did nothing to deal with memory.

1 reply

sampsyo May 18, 2022
Maintainer Author

Seems quite solid!! That's a good point about the need (or rather optional want) to insert new labels in the middle of blocks. Trimming things at the granularity of labels is a good solution.

It would be interesting to quantify "worse," however! It's not exactly shocking that the code gets slower, but it would be good to know how much worse.

zzzDavid · 2022-04-22T04:30:15Z

zzzDavid
Apr 22, 2022

My implementation is here.

Implementation

I implemented the inter-procedure version of tracing and tested it on the benchmarks in the bril project.

Details

First modify the interpreter to print out the instruction being executed as trace. I started tracing at the beginning of main to end of execution.
Inline all function calls. This involves detecting all call operations, copy the input arguments and the return values, if any, with id.
Remove all jmp operations
Turn br into guard operations.
Enclose the trace with speculate and commit, and then insert the trace back to the top of main function.

Results and Limitations

Limitation

I didn't handle recursive function calls and pointers. Simply copying the function arguments and return values doesn't work for recursive function calls as values in that function will be overwritten, I imagine there should some stack data structure to keep the values in each recursive call.

Results

I tested the tracing JIT on 23 benchmarks in the bril project with turnt. The results are all correct, and I got similar dynamic instruction count compared with baseline. This is as expected since I did not perform optimizations on the trace, jmp is removed but speculate and commit was added, so overall the dynamic instruction count is similar before and after using trace-based JIT.

Discussion

Turning `br` into `guard`

We need to sort of "predict the future" for branch operation: if br takes the false branch, we need to invert the condition for the guard operation. Otherwise the guard fails and then fall back to the entry of main function and make instruction count way worse.
Since br is used to jump to exit block for loops, br will eventually take the false branch. So it becomes necessary to invert the false condition.

I think such condition inverting would be unnecessary in a real JIT compiler if we profile a path up until a back edge. In that case, we can commit before the back edge and fall back to the loop header block when guard fails.

1 reply

sampsyo May 18, 2022
Maintainer Author

Really nice work; this seems great!! 🎉 It makes sense that the instruction count wouldn't change much, but I would have expected some reduction from inlining alone (eliminating call and jmp). That's not a big change but should be measurable… so it would be interesting to know what the bottom-line numbers are.

I didn't handle recursive function calls and pointers. Simply copying the function arguments and return values doesn't work for recursive function calls as values in that function will be overwritten, I imagine there should some stack data structure to keep the values in each recursive call.

This is an interesting point. I think it may be possible to cope with recursive calls even without a stack for variable storage using variable renaming. Because you know, after finishing the trace, how many calls of a given function f have occurred in the trace, you can use the "index" of the call to make the variables unique. Like, if f has a variable x, then your tracing as-is might name that inlined variable f_x. In this proposal, you would call it f_x_0 in the first call to f, f_x_1 on the second, and so on.

ayakayorihiro · 2022-04-22T04:36:44Z

ayakayorihiro
Apr 22, 2022

@5hubh4m, @anshumanmohan, and @ayakayorihiro worked together on this assignment.
Our work on modifying the reference interpreter for tracing is here, and our optimizer is here.

Implementation

We implemented a simple trace-based speculative optimizer for Bril following the recipe from the task documentation.

Modifying the Reference Interpreter

We modified the interpreter to trace an entire function. We run tracing only if the funtion was called a certain number of times. When we encounter the next invocation of the function, after it becomes hot, we trace all instructions that were executed. For all branch instructions that we encounter in the function, we additionally record the value of the boolean into the trace, which will later become a guard instruction. We output the trace to stderr, so a user would need to redirect that to a file when running the interpreter in order to record the trace.

$> bril2json < n_root.bril | brili -t 2> /tmp/trace.json

Optimizer

Our optimizer reads in the original program from stdin and the trace file as argument, and parses the trace file to obtain the traces. If a trace is available for a function, it weaves the trace into it contained within a speculate ... commit block. The optimizer removes all jmp instructions, and moves the ret instruction to the end of the code in speculation mode, after the commit. Additionally, the optimizer changes all br instructions to guard instructions, where the guard instruction ensures that the path taken by the trace is preserved (so, if the boolean was false in the execution that we took the trace of, then the guard will ensure that the boolean is still false in any execution that will continue in speculation mode).

$> bril2json < n_root.bril | python3 speculate.py /tmp/trace.json

Testing

We verified the correctness our work on all of the Bril benchmarks that do not use the memory extension.

$> cd cs6120/test/trace
$> turnt *.bril

The most interesting benchmark was n_root.bril where our optimisation was able to save 20 instructions.

@pow(x: float, k: int): float {        @pow(x: float, k: int): float {
  xx: float = const 1.0;                 speculate;
  one: int = const 1;                    xx: float = const 1;
  i: int = const 0;                      one: int = const 1;
.while:                                  i: int = const 0;
  b: bool = lt i k;                      b: bool = lt i k;
  br b .continue .endwhile;              guard b .startLabel;
.continue:                       ====>   xx: float = fmul xx x;
  xx: float = fmul xx x;                 i: int = add i one;
  i: int = add i one;                    b: bool = lt i k;
  jmp .while;                            guard b .startLabel;
.endwhile:                               xx: float = fmul xx x;
  ret xx;                                i: int = add i one;
}                                        b: bool = lt i k;
                                         guard b .startLabel;
                                         xx: float = fmul xx x;
                                         i: int = add i one;
                                         b: bool = lt i k;
                                         guard b .startLabel;
                                         xx: float = fmul xx x;
                                         i: int = add i one;
                                         b: bool = lt i k;
                                         b_not: bool = not b;
                                         guard b_not .startLabel;
                                         commit;
                                         ret xx;
                                       .startLabel:
                                         xx: float = const 1.0;
                                         one: int = const 1;
                                         i: int = const 0;
                                       .while:
                                         b: bool = lt i k;
                                         br b .continue .endwhile;
                                       .continue:
                                         xx: float = fmul xx x;
                                         i: int = add i one;
                                         jmp .while;
                                       .endwhile:
                                         ret xx;
                                       }

Challenges

As before, we continued to face challenges in dealing with some of the intricacies of the TypeScript language. An implementation challenge we faced was dealing with recursive function calls within our tracing mechanism. Additionally, a challenge that we faced was implementing the guards, which led to the realization that we had to record the result of the conditional for every branch instruction. Furthermore, we initially implemented our tracing mechanism on the interprocedural level, but found the interaction between multiple function calls a large challenge (namely, the return of function subroutines caused interference in determining when to start and stop tracing). After grappling with this challenge for a while, we decided to perform global tracing rather than interprocedural tracing.

Our implementation has a fundamental limitation, which is that we simply trace the path taken by the threshold-th run. So, if the threshold-th invocation of the function does not take a commonly taken path, our optimizier could produce a less-efficient version of the original program.

3 replies

sampsyo May 18, 2022
Maintainer Author

Nice; super cool! Doing this at the top of functions is a solid and effective idea.

You mentioned a challenge with recursive functions—can you say a couple more words about that? Sounds interesting. See also #317 (reply in thread) for more discussion about recursive function calls. Sounds like you retreated from this and stayed within functions, which is perfectly legit.

It also sounds like you didn't do straight-line optimizations on the traced code, right? Like, this snippet from your traced example seems like it could get cleaned up by an LVN pass and make your results look even better:

i: int = const 0;
b: bool = lt i k;

[edit: oh wait, just realized k isn't guaranteed to be positive. Anyway, still curious about whether you ran optimizations on the post-tracing code!]

5hubh4m May 18, 2022

One challenge with tracing functions recursively (i.e tracing across function boundaries) was having to perform all the inlining logic post-hoc, as the trace emitted by the interpreter would just have local variables, so we would have had to identify function boundaries and do a lot of logical manipulations to correctly trace the inlined code of another function - se we decided to bail on any functions that have call in them.

We didn't run any optimizations on the code.

sampsyo May 18, 2022
Maintainer Author

Makes total sense to me!

atucker · 2022-04-22T06:34:49Z

atucker
Apr 22, 2022

My code is here. I implemented a very basic JIT compiler which works on all benchmarks except collatz, and only ever makes things worse.

Implementation and Challenges

At a high level, my modified bril interpreter starts tracing whenever it traverses a backedge, then continues until it loops back to itself, or hits something that it needs to bail out for (call, speculate). However, it doesn't actually optimize its straightlined code, so this only decreases performance by executing speculative instructions that it may not actually use.

Running python from typescript

The most fun part of my implementation is in the callPython function. Basically, I wanted to reuse as much of my pre-existing code as possible, and so I wanted to be able to call python code. While typescript/python interoperability in general is complicated, the UNIX philosophy of having programs communicate primarily through text streams helped out a ton here. Since most of the python written in class used stdin to get its inputs and stdout to share its outputs, it was very amenable to this structure. All I needed to do was spawn my python code as a subprocess, give it its inputs through its stdin, and then wait to hear back about its results through its stdout. The stdout event listener was refactored out of the reference interpreter's method for reading its stdin stream which went super smoothly, and basically demonstrates how great piping and such is.

Getting this all to work did also force me to interact with Javascript's Promises, and generally asynchronous nature which was interesting more than painful. If I had made it to the point of actually using my previously written python compiler passes I might have felt differently, but I don't think it would have been too bad to 1) defer modifying the code until I hear back from python and 2) just keep checking that the size of func.instrs didn't change in evalFunc, since that's the only place that cares about more than one instruction at once.

I only wound up using this function to use my old code for computing the dominators of a given basic block, which I think was still easier than reimplementing all of that in typescript.

Backedges

I went with the suggestion to just start tracing every time you hit a backedge. I think that this is a super reasonable starting point for a tracing interpreter, since the main benefit of using JIT compilation is to improve code that gets used and reused a lot. The easiest way for that to happen is just for the code to be in a loop. Since the most basic loop is going to run through one path many times and then exit once, tracing through a backedge means that you're going to find that one path.

Tracking traces

I managed to keep track of all of my traces just in a javascript array. If I were doing longer traces then maybe writing it out would have been necessary, but as is my traces were never that long anyway.

I was worried that I would have to do more work to avoid trying to trace through the code that I just generated, but it was super easy to avoid because all it required doing was just bailing out of tracing whenever I saw a speculate. A better version of this JIT would probably try to remember that it had wanted to trace, and then continue tracing after it got past all the newly inserted code. Alas.

Inserting code

Another tricky part was just inserting code into the program while the interpreter is still running it. At first this seemed suspiciously not a major problem, because mostly I was just tracing loops, then finalizing the code right when I saw that I had returned to the beginning. This meant that I was only ever inserting code just after what the interpreter was trying to run. However, once I started running the benchmarks I saw that while the "stop because we've looped now" case was working well, finalizing my trace at any other point was failing, because I would insert code which would change the func.instrs array, and throw off the interpreter. But then I just added some state to keep track of how much I would need to offset when I got back to running instructions, and that fixed that problem.

Measurement

I measured my code by running python brench.py jit.toml in the brench directory. It only made things worse.

It is correct on all benchmarks except for collatz. The dominator information for its basic blocks looks a bit weird, so I think that it might be an issue in my old code rather than the JIT. Then again who knows, maybe in the light of day the dominator information will make sense.

On most programs it has no impact, I think because it just never finds a backedge.

Whenever it does something, it makes it worse. This is just because it has to execute speculative instructions which it might not use, but its reinserted code isn't actually doing anything better in any way so it can't make up for these unused speculations.

2 replies

sampsyo May 18, 2022
Maintainer Author

Neato!! Working around back edges seems like a great way to approach the "when to trace" question. And it's very interesting/cool that this led you to need to call out to your Python stuff directly from within the interpreter, leading to a "proper" JIT that really does analyze/transform the code at run time rather than doing it as a two-phase process. Really cool that this worked out!! 👏

It's not overly surprising that the overhead dominates the elision of jmp instructions or whatever, but it would be interesting to know how much worse it is. Maybe in the future I should provide some hooks to let people distinguish between "real" code and the speculation bookkeeping overhead…

sampsyo May 18, 2022
Maintainer Author

See also #317 (comment) for another instance of collatz being trouble.

orkosinha · 2022-04-22T07:26:06Z

orkosinha
Apr 22, 2022

My implementation of the modified brili interpreter with tracing.

Tracing

Tracing is implemented starting from the main function to the first call, print, store, alloc, or free as a starting point. I construct the trace with the modified interpreter in the first run, add it to the program and run it through the non-modified interpreter.

Some specifications to my trace function

I need to bail out of edges that contain things I can't support. Basically just set an always false guard in those scenarios.
I need to invert the value of branch conditions so guards are not triggered to bail out unnecessarily.
If a jmp has been encountered before while tracing, then I should commit and end tracing.

Results

The results were not impressive, probably the opposite as nothing really got any performance improvements. However, the way tracing is implemented here, it makes sense as it's only from the main function with no optimizations performed on the trace.

Here are the results from the benchmarks, thankfully nothing crashed and the same results were produced. So, it's a very slight success in that I am adding code and getting the same results. I thought it was interesting that the pow benchmark gained a 25% of it's original instruction count.

Benchmark	Percent Change
mat-mul	0.00%
ackermann	0.00%
catalan	0.00%
eight-queens	0.00%
primes-between	0.00%
adj2csr	0.01%
pythagorean_triple	0.01%
adler32	0.06%
sieve	0.09%
orders	0.09%
check-primes	0.12%
sum-sq-diff	0.13%
cholesky	0.16%
relative-primes	0.31%
quadratic	0.38%
mat-inv	0.48%
sqrt	0.62%
n_root	0.68%
fizz-buzz	0.96%
bubblesort	1.19%
up-arrow	1.19%
perfect	1.29%
max-subarray	1.55%
euclid	1.60%
loopfact	1.72%
riemann	2.01%
digital-root	2.43%
binary-fmt	3.00%
collatz	3.55%
newton	3.69%
pascals-row	4.11%
gcd	4.35%
armstrong	4.51%
recfact	4.81%
fib	4.96%
sum-divisors	10.06%
sum-bits	12.33%
binary-search	15.38%
ray-sphere-intersection	16.20%
factors	18.06%
rectangles-area-difference	21.43%
pow	25.00%

1 reply

sampsyo May 18, 2022
Maintainer Author

Nice; looks good overall! Maybe it would be useful to stare directly at the generated code for, for example, mat-mul and pow to see what's different between them, if anything is obvious?

And I assume that you didn't run any straight-line optimizations on the traced code?

andrewb1999 · 2022-04-23T19:26:11Z

andrewb1999
Apr 23, 2022

Sorry for the late post, here is my code.

Implementation

My code is implemented in two parts. I have a modified version of brili (in typescript) to log a trace for the entire program and a rust trace stitching program. The modified version of brili logs a trace for the entire run of the program to a file in /tmp including the instruction and it's line number in the main function. The trace stitching program than reads in the original program and the trace. The trace is cutoff at the first use of print, store, call, alloc, or free to prevent a trace calling unsupported functionality. Call is only not supported because I ran out of time to implement the copying of function arguments and return values in the trace. I also remove any jumps from the trace and replace branch instructions with guards. To ensure the generated guards are correct, the modified version of brili will add an extra not instruction to the trace before a branch that is not taken.

When traces are stitched into the original program, they are placed at the start of the program, and if they successfully complete, jump to the location of the last instruction in the trace (this is why it's important to know the line number of instructions in the original trace). If speculative execution fails, it jumps back to the start of the normal program.

Testing

I used turnt to test against a subset of bril benchmarks. While the resulting programs work correctly, they are almost always worse than the original code because of the lack of optimization on traces. These optimizations should be able to be implemented within the stitching framework with relative ease, given a bit more time.

Issues

I ran into some issues when trying to implement the copying of arguments and return values within a trace. I realized that because of nested function calls we need to be careful about matching function call with returns in the trace. Due to scheduling conflicts this week, I ran out of time to get this working and be able to support function calls.

1 reply

sampsyo May 18, 2022
Maintainer Author

All sounds good!! It would be interesting to hear a little more about how the testing worked out. You said you used a subset—which subset was it? (The ones that don't use the memory extension, for example?) And while it's not surprising that the code mostly got worse (slower), it would be interesting to know how much worse.

The ret problem makes a lot of sense too. It's possible that a "simple" solution to this would be to just make the trace interprocedural, in the sense that call and ret instructions would "fall out" during tracing. Anyway, just something to ponder!

alaiasolkobreslin · 2022-04-24T04:54:56Z

alaiasolkobreslin
Apr 24, 2022

The modified bril interpreter is here. The code that transforms a program given its trace is here.

Implementation

Bril Interpreter

I modified the bril interpreter to produce traces. I added some new properties to the state type: seen is a list of labels that have been encountered. shouldTrace determines whether the interpreter should be tracing instructions. The interpreter starts tracing when entering the main function. If an instruction is encountered and shouldTrace is true, the the instruction will be printed (in most cases). Some exceptions include speculation instructions (since we shouldn't be transforming a program that was already transformed), memory instructions, return instructions, and call instructions. In these cases, the interpreter stops tracing all together- it prints a commit instruction and sets shouldTrace to false. If a branch instruction is seen, a guard instruction is printed (with a temporary label that will get switched out during the transformation phase). Every label encountered is added to the seen list. If a label is encountered that is already in the seen list, we stop tracing.

Transforming the Program

The program transformer takes in a json of a bril program and also reads from a trace from standard input. It inserts all traced instructions to the start of the main function, except in the case of guard instructions. In this case, we replace the label argument with a fresh label that represents the start of the original program. This means that if the guard ever fails, we bail out and jump to the original start of the program.

Here is an example of the gcd original program and then the transformed program
Original:

# ARGS: 4 20
@main (op1: int, op2: int) {
  # const
  vc0: int = const 0;
  # take two input ops, first iteration
  v0: int = id op1;
  v1: int = id op2;
.cmp.val:
  v2: bool = lt v0 v1;
  br v2 .if.1 .else.1;
.if.1:
  v3: int = sub v1 v0;
  jmp .loop.bound;
.else.1:
  v3: int = sub v0 v1;
  jmp .loop.bound;
  # check results
.loop.bound:
  v4: bool = eq v3 vc0;
  br v4 .program.end .update.val;
.update.val:
  br v2 .if.2 .else.2;
  # update v1
.if.2:
  v1: int = id v3;
  jmp .cmp.val;
  # update v0
.else.2:
  v0: int = id v3;
  jmp .cmp.val;
  # print out the results
.program.end:
  print v1;
}

Transformed:

@main(op1: int, op2: int) {
  speculate;
  vc0: int = const 0;
  v0: int = id op1;
  v1: int = id op2;
  v2: bool = lt v0 v1;
  guard v2 .fresh;
  v3: int = sub v1 v0;
  v4: bool = eq v3 vc0;
  guard v4 .fresh;
  guard v2 .fresh;
  v1: int = id v3;
  commit;
.fresh:
  vc0: int = const 0;
  v0: int = id op1;
  v1: int = id op2;
.cmp.val:
  v2: bool = lt v0 v1;
  br v2 .if.1 .else.1;
.if.1:
  v3: int = sub v1 v0;
  jmp .loop.bound;
.else.1:
  v3: int = sub v0 v1;
  jmp .loop.bound;
.loop.bound:
  v4: bool = eq v3 vc0;
  br v4 .program.end .update.val;
.update.val:
  br v2 .if.2 .else.2;
.if.2:
  v1: int = id v3;
  jmp .cmp.val;
.else.2:
  v0: int = id v3;
  jmp .cmp.val;
.program.end:
  print v1;
}

Evaluation

I checked the benchmarks using turnt. All of them passed except collatz because there was a print instruction that was executed before bailing out. I checked the total dynamic instructions for many benchmarks before and after the transformation, and all of them had more instructions executed after the transformation. This makes sense because I didn't apply any other optimizations.

Discussion

I think I could have made the program transformer produce more efficient code. I did what was easiest with the guard labels, so any time a guard fails the program jumps all the way to the beginning. This is pretty bad especially if the program has passed many guards and only fails on the last one. It would be better if the transformer picked a guard label in order to minimize instructions executed.

1 reply

sampsyo May 18, 2022
Maintainer Author

Thanks for the clear writeup and the example! The interpreter modifications required here were pretty clear. As you say, given more time, an interesting way to "level this up" would be becoming more sophisticated about where the traces start/end/resume.

barabanshek · 2022-04-25T17:28:19Z

barabanshek
Apr 25, 2022

Hi, sorry for the late reply. My implementations are here:

How it works

To run programs faster with JIT, I first collected traces with the extended version of the brili.ts (I just added a tracing class in it which records traces from the beginning of main() till the end, and dumps them in file). Then I wrote a python program to optimize and stitch traces back to the original brili program. The program does three steps: (1) process traces to prepare for stitching, (2) optimization, (3) stitching.

(1) Includes:

Removing jumps;
Replacing branches with Guard instructions;
Appending Commit and Speculate instructions to the end and beginning of the trace segment.

(2) Includes:

LVN
DCE
run till convergence

(3) Includes:

merging of the optimized and prepared for stitching tracing segment with the original program
add another label very_end where the execution jumps right after commit instruction to avoid doubled execution

Results

Example original program:

@main(size: int) {
  a: int = const 5;
  b: int = const 10;
  d: int = const 15;
  e: int = const 20;
  f: int = const 25;
  cond: bool = eq size b;
  br cond .left .right;
.left:
  b: int = const 1;
  c: int = const 5;
  jmp .end;
.right:
  a: int = const 2;
  cc: int = add d e;
  dd: int = add e f;
  c: int = add dd cc;
  jmp .end;
.end:
  d: int = sub a c;
  print d;
}

Example of the JIT-prepended program when recording the trace for the input int = 10

@main(size: int) {
  speculate;
  a: int = const 5;
  b: int = const 10;
  cond: bool = eq size b;
  guard cond .trace_entry;
  d: int = sub a a;
  print d;
  commit;
  jmp .very_end;
.trace_entry:
  a: int = const 5;
  b: int = const 10;
  d: int = const 15;
  e: int = const 20;
  f: int = const 25;
  cond: bool = eq size b;
  br cond .left .right;
.left:
  b: int = const 1;
  c: int = const 5;
  jmp .end;
.right:
  a: int = const 2;
  cc: int = add d e;
  dd: int = add e f;
  c: int = add dd cc;
  jmp .end;
.end:
  d: int = sub a c;
  print d;
.very_end:
}

The above JIT code is not the most optimal due to the limitations of my LVN/DCE optimizations that I got form the first assignments. However, it does show some improvements:

when running with the input size != 10 (fails to execute trace) the total number of instructions remains the same
when running with the input size = 10 (execute optimized JIT code):

bril2json < ../assignment_9/example.bril | python3 stich.py ../bril-ts/trace.jit | brili 10 -p
0
total_dyn_inst: 9

while the original execution:

bril2json < ../assignment_9/example.bril | brili 10 -p
0
total_dyn_inst: 12

Limitations

Currently, no support for inter-procedural JIT, just decided to prioritize optimizations over it.

1 reply

sampsyo May 18, 2022
Maintainer Author

Awesome; cool that you were able to optimize your traces as well! Did you get a chance to run your tracer across all the benchmarks to check for correctness and a "longitudinal" picture of performance, beyond this one example?

andreyyao · 2022-04-25T19:43:52Z

andreyyao
Apr 25, 2022

Sorry for the late reply. Most of my work are in these two places: modified brili.ts and [python script bril-jit.py

Implementation

Producing trace

I first modified the typescript interpreter to make it produce the trace through standard output. I basically added a field traceremaining to state, which keeps track of how many more instructions to print to trace. This number starts from 2000 and counts down each time a trace instruction is printed. Traces are printed to stderr to avoid conflict with stdout.

When I encounter branch instructions on some cond, if cond is true then I simply print guard cond .speculate-fail. If cond is false then I first trace a boolean negation instruction that puts not cond into a temp variable v, which is named by appending the string of current instruction count to a long name in order to prevent naming conflicts. Then I print guard v .speculate-fail to trace. For jump instructions I simply print nothing. And whenever I see call, alloc, free, store, ret, I terminate tracing by setting state.traceremaining to 0.

Note that when I terminate tracing(this always happens in main() in my case), I also print the position of the current instruction to stderr. This will be used later to determine where to jump to if the trace runs successfully.

Stitching

Then, I wrote a python script bril-jit.py to stitch together the trace with the original program. The program takes a bril program in json format from stdin and outputs the stitched program to stdout. Note that no arguments are supplied, for I instead randomly generate the bril program arguments based in main's argument types and arity, using the rand_arg function.

I adopted the categorical dual of @atucker's approach by flipping the arrows to the other way(/s), i.e. I generated the trace string by calling the typescript interpreter as a subprocess from my python script and collecting its stderr which is then encoded into an array of instructions(besides the last line, which contains an int which is the position in main to jump to upon speculate success). Let's call this array <trace>. Then I call stitch(prog, trace, pos) which prints the stitched program to stdout. The output looks more or less like this:

@main(){
  speculate
  <trace>
  commit
  jump .speculate-success
  .speculate-fail:
  <main-instrs'>
}
...

here <main-instrs'> is the original instructions of main() but with speculate-success inserted at pos.

Testing

I tested using brench. The output passes all but three or four test cases in the benchmarks directory, all due to some "accessing uninitialized memory" error, which I unfortunately could not figure out. Otherwise the code is correct. There is no optimization involved, so each case is slower by a small percertage(not much tho, surprisingly).

Difficulties

TypeScript was not hard to learn. However I initially had trouble understanding how it is possible to make the trace and original program communicate "dynamically". But then I realized that we really are just doing syntactic manipulations mostly in this assignment.

3 replies

sampsyo May 18, 2022
Maintainer Author

Nice! I like your approach to using an instruction-count timer to limit the sizes of traces. This makes things slightly more tricky, what with needing to keep track of the position of the instruction where the trace ended. But tricky in a good way, I think!

The error in the memory-using benchmarks makes sense, probably, because memory does not get rolled back in speculation mode. So there probably are indeed some correctness problems if a benchmark ever uses a pointer.

You mentioned that everything is a little bit slower—do you have the numbers to show exactly how much?

But then I realized that we really are just doing syntactic manipulations mostly in this assignment.

Indeed; I set things up so everything could be technically "ahead of time," even if what we're modeling is a JIT-like optimization.

andreyyao May 18, 2022

Nice! I like your approach to using an instruction-count timer to limit the sizes of traces. This makes things slightly more tricky, what with > needing to keep track of the position of the instruction where the trace ended. But tricky in a good way, I think!

Ooops, it never occured to me that you do not need to record the position of the instruction if the tracing is terminated at more defined places like only at returns or function calls... I thought remembering the code position was just a part of a tracing jit. Oh well.

I just reran the test cases. Here are the numbers:

benchmark	baseline	opt	diff
mat-mul	1990407	1990410	0.00%
sum-divisors	159	164	3.14%
euclid	563	566	0.53%
newton	217	220	1.38%
riemann	298	301	1.01%
perfect	232	243	4.74%
mat-inv	1044	1047	0.29%
digital-root	247	250	1.21%
collatz	169	171	1.18%
binary-fmt	100	103	3.00%
gcd	46	77	67.39%
n_root	733	736	0.41%
check-primes	8468	8471	0.04%
recfact	104	107	2.88%
armstrong	133	136	2.26%
sum-bits	73	77	5.48%
fib	121	124	2.48%
sqrt	322	315	-2.17%
fizz-buzz	3652	3657	0.14%
primes-between	574100	574105	0.00%
eight-queens	1006454	1006457	0.00%
binary-search	78	81	3.85%
pythagorean_triple	61518	61600	0.13%
cholesky	3761	3764	0.08%
catalan	659378	659381	0.00%
relative-primes	1923	1926	0.16%
factors	72	79	9.72%
loopfact	116	175	50.86%
ray-sphere-intersection	142	145	2.11%
bubblesort	253	incorrect	#VALUE!
sum-sq-diff	3038	3041	0.10%
adj2csr	56629	56632	0.01%
adler32	6851	6854	0.04%
sieve	3482	incorrect	#VALUE!
orders	5352	5355	0.06%
up-arrow	252	255	1.19%
pow	36	39	8.33%
max-subarray	193	incorrect	#VALUE!
quadratic	785	788	0.38%
ackermann	1464231	incorrect	#VALUE!
rectangles-area-difference	14	17	21.43%
pascals-row	146	149	2.05%

sampsyo May 18, 2022
Maintainer Author

Ah, neato; thanks for the table!!

tonyjie · 2022-04-26T03:52:51Z

tonyjie
Apr 26, 2022

Sorry for the late post. My implementation is here.

Trace-based Speculative Optimizer for Bril

The task is to implement a trace-based speculative optimizer for Bril. You’ll implement the same concept as in a tracing JIT, but in a profile-guided AOT setting: profiling, transformation, and execution will be distinct phases. The idea is to implement the “heavy lifting” for a trace-based JIT without needing all the scaffolding that a complete JIT requires, such as on-stack replacement.

Test

Here for simplicity, I only test on two examples: loop and if-const.

The output is correct, and the dynamic instrs between original program and tracing program is nearly the same. That's because I haven't implemented optimizations here. The difference is that tracing program would add some label, speculate, commit, and remove jmp. DCE and LVN optimizations could be applied here to reduce the dynamic instructions.

loop

The output answer remains the same. The original program has dynamic instrs = 26; the tracing program has dynamic instrs = 25.

if-const

The output answer remains the same. The original program has dynamic instrs = 5; the tracing program has dynamic instrs = 7.

Implementation

Modify the reference interpreter to produce traces

Insert console.log(">", JSON.stringify(instr)); in the evalInstr() function to record every instruction as it executes.

">" prefix is added to distinguish the instruction tracing from the normal print output.

Transform tracing instructions to required straight-line code

Eliminate jumps.
replace branched br with guard instructions. Here is something tricky: as we trace the whole program for simplicity, if we fail on any guard in the loop, we would fallback to the beginning of the original program and run it again. In the loop example, if we directly use cond of br as the cond of guard, this would lead to all the success guard and a fail at the last guard. This would lead to very bad performance: it nearly run the program twice!

So a workaround is that we change the condition of guard instruction based on profiling result (take cond or not): we want the whole trace to be executed. This is not what guard is designed to do for sure, but it is implemented here just because we trace the whole program and don't want it to be bad on the loop example.

Stitch the trace back into the program

As we trace the whole program, we add speculate at the beginning, and commit at the end of tracing code. Add another ret after commit so that we can end the tracing program if all the guard succeed in the speculate block.

After that, the original program block is added as the fallback of the guard.

1 reply

sampsyo May 18, 2022
Maintainer Author

Hmm; I don't think I understand what you mean about the trickiness surrounding turning br instructions into guard instructions. You do indeed need to make the guard enforce the same control flow as the original trace: i.e., the condition should be "guarded true" if it was true in the trace and "guarded false" if it was false. But I'm just not sure about this bit:

This is not what guard is designed to do for sure, but it is implemented here just because we trace the whole program and don't want it to be bad on the loop example.

…but what I described above is indeed exactly what guard was meant for (to ensure that we keep executing along the CFG path indicated by the original trace). Maybe you can explain a little more about why you thought this was not the intended use?

Can you also clarify a little bit (beyond "for simplicity") why you only tried two benchmarks? It seems like it would probably be a good idea to gain greater confidence that the trace manipulation you've performed is at least correct.

gsvic · 2022-05-17T15:23:49Z

gsvic
May 17, 2022

Super-late submission. My code can be found here. I implemented a tracer as part of brili.ts and an injector python script, which reads the output of the tracer, generates the speculative code and inserts it in the program.

Implementation

Tracer

The tracer code can be found here. There is a Tracer class which executes some basic functionality, like function call handling (addFunctionCall), checks if a function is hot (isHot), etc. The tracer can be activated and deactivated for a specific method by invoking the activate(name: string) and deactivate() respectively. After a function becomes hot (the frequency is defined in the hotLimit variable), the tracing is activated for this function. Each trace record of each function goes to the specific function's entry in a HashMap. Once the program finishes its execution, the trace is dumped into a json file, called trace.json. Then, we need to run injector.py in order to manipulate the trace.

Injector

The injector is responsible for parsing the trace (trace.json), generating the speculative code and inserting it to the program at the right position. Currently, it checks if there is a branch, and if so, it replaces the branch with a speculate block. After this block, it adds a “deoptimize” block which contains the initial branch code, in case that the guard fails. The guard takes as arguments the same arguments of the current branch. It works either with simple branch instructions, and in case of function calls it also considers inlining the code of that function (as function calls are not allowed during speculation). The following examples demonstrate both behaviors.

Simple Branch

Consider the following program

@ph(x: int): int {
    one: int = const 1;
    res: int = id x;
    cond: int = const 20;

    b: boolean = lt res cond;

    br b .if .else;

    .if:
        res = add res one;
    .else:

    ret res;
}

@main{
    one: int = const 1;
    i: int = const 1;
    res: int = const 0;

    cond: int = const 10;

    .loop:
        b: boolean = lt res cond;
        br b .while .end;
    .while:
        i: int = add i one;
        b: boolean = lt i cond;

        res: int = call @ph res;

        jmp .loop;

    .end:
        print res;
}

After parsing the trance, the injector change the if-else block and will replace it with the following speculate block:

speculate;
  res = add res one;
  guard b .deoptimize;
  commit;
.deoptimize:
  br b .if .else;
.if:
  res = add res one;
.else:
  ret res;

So in case that the guard fails, it will jump to the deoptimize block and it will run the initial branch.

Inlining

Something similar happens when a function call occurs inside a branch. Let's consider the following example:

@f1(z: int):int {
    one: int = const 1;
    tmp: int = add z one;

    ret tmp;
}

@ph(x: int): int {
    one: int = const 1;
    res: int = id x;
    cond: int = const 20;

    b: boolean = lt res cond;

    br b .if .else;

    .if:
        res: int = call @f1 res;
    .else:

    ret res;
}

@main{
    one: int = const 1;
    i: int = const 1;
    res: int = const 0;

    cond: int = const 10;

    .loop:
        b: boolean = lt res cond;
        br b .while .end;
    .while:
        i: int = add i one;
        b: boolean = lt i cond;

        res: int = call @ph res;

        jmp .loop;

    .end:
        print res;

}

Now, function ph calls function f1 inside the branch. After tracing, the ph code will be transformed as follows:

@ph(x: int): int {
  one: int = const 1;
  res: int = id x;
  cond: int = const 20;
  b: boolean = lt res cond;
  speculate;
  z = id res;
  one: int = const 1;
  tmp: int = add z one;
  res = id tmp;
  guard b .deoptimize;
  commit;
.deoptimize:
  br b .if .else;
.if:
  res: int = call @f1 res;
.else:
  ret res;
}

We can see that the whole f1 method is inlined inside the speculate block.

Experiments

Although I am not very confident about this implementation, in the two examples above it looks like it saves a couple of instructions. I run one experiment for each of those two benchmarks: simple-if.bril and simple-if-calls.bril. The first one includes a branch in which it executes an add instruction, and the second invokes a function (f1) inside the branch. The results are the following:

Benchmark	Without JIT	With JIT
`simple-if.bril`	137	92
`simple-if-calls.bril`	167	122

The validity of the results looks good, as they produce the same output. However, it definitely requires more test-cases to make sure that everything works fine.

1 reply

sampsyo May 18, 2022
Maintainer Author

Neat; it sounds like you started tracing at the beginning of hot functions only, which is a sound approach. While it's good that this works on a few hand-rolled tests, it would also have been interesting to know whether it works more broadly on the set of benchmarks we have—even if you don't like the answer to that question. 😃

Lesson 12: Dynamic Compilers #317

sampsyo Apr 11, 2022 Maintainer

Replies: 13 comments · 19 replies

Tracing-Based JIT

Testing

sampsyo Apr 21, 2022 Maintainer Author

Summary

Results

sampsyo Apr 21, 2022 Maintainer Author

what happened

results

sampsyo May 18, 2022 Maintainer Author

Implementation

Details

Results and Limitations

Limitation

Results

Discussion

Turning br into guard

sampsyo May 18, 2022 Maintainer Author

Implementation

Modifying the Reference Interpreter

Optimizer

Testing

Challenges

sampsyo May 18, 2022 Maintainer Author

sampsyo May 18, 2022 Maintainer Author

Implementation and Challenges

Running python from typescript

Backedges

Tracking traces

Inserting code

Measurement

sampsyo May 18, 2022 Maintainer Author

sampsyo May 18, 2022 Maintainer Author

Tracing

Results

sampsyo May 18, 2022 Maintainer Author

Implementation

Testing

Issues

sampsyo May 18, 2022 Maintainer Author

Implementation

Bril Interpreter

Transforming the Program

Evaluation

Discussion

sampsyo May 18, 2022 Maintainer Author

How it works

Results

Limitations

sampsyo May 18, 2022 Maintainer Author

Implementation

Producing trace

Stitching

Testing

Difficulties

sampsyo May 18, 2022 Maintainer Author

sampsyo May 18, 2022 Maintainer Author

Trace-based Speculative Optimizer for Bril

Test

Implementation

Modify the reference interpreter to produce traces

Transform tracing instructions to required straight-line code

Stitch the trace back into the program

sampsyo May 18, 2022 Maintainer Author

Implementation

Tracer

Injector

Simple Branch

Inlining

Experiments

sampsyo May 18, 2022 Maintainer Author

sampsyo
Apr 11, 2022
Maintainer

Replies: 13 comments 19 replies

sampsyo Apr 21, 2022
Maintainer Author

sampsyo Apr 21, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

Turning `br` into `guard`

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author

sampsyo May 18, 2022
Maintainer Author