Lesson 12: Dynamic Compilers #355

sampsyo · 2023-08-21T20:23:12Z

sampsyo
Aug 21, 2023
Maintainer

Here's the thread for the dynamic compilers task, which involves doing some speculative transformations on Bril IR!

MelindaFang-code · 2023-11-05T19:49:41Z

MelindaFang-code
Nov 5, 2023

Summary

Collaborated with @collinzrj .
We implemented just-in-time compiler feature inbrili_jit.ts. We traced the execution of a function called myloop for 5 times, find the most frequent trace, perform some optimizations on that, and plug it back to the front of that function. In order to roll-back when some conditions is not met, we replace all br operations with a guard operation. The guard operation takes a variable called condition, and the condition is set to either the same as the argument of br, or negation of that argument, if we traced the second branch. When the guard operation failed, it jumps to a failed label inserted at the beginning of the original instructions. When we plug the instructions back to the function, we insert an speculate operation at the beginning of the trace and a commit before the ret instruction of the trace.

We implemented the dead code elimination optimization on the trace. If a variable is only used in one branch but not the other, when we eliminate the branch, we will be able to eliminate that variable and other variables that are only used by that. We have also written an example bril code that demonstrates the optimization.

Hardest Part

The most challenging part of this task is to figure out how just-in-time compiler works. Bril has provided good abstractions for us, the speculate, guard, and commit operations. With these operations, we can easily enter a speculation mode, and discard the computation when the assumption of the code is incorrect. It's also a little tricky to insert the trace back to the function, we spent a bit of time studying what should guard take as its argument.

Testing

We create a test case in "test_jit.bril". It has some dead code that could be eliminated in the function myloop if the program executes the most commonly used branch. Here is the pseudocode. We tried different inputs such as 45 and 100 that would cause the guard to be evaluated differently and compared both the results and the dynamic instruction count.

function myloop(x) {
  y = x + 1;
  a1 = 8;
  a2 = a1 + a1;
  a3 = a2 + a2;
  a4 = a3 + a3;
  if (x < 50) {
    z = y - 1;
  } else {
    z = y + a4;
  }
  return z;
}

function main(a) {
  index = 0;
  result = 0;
  for (index in range (a)) {
    result = myloop(index);
  }
  print(result);
}

After the Optimization the bril code looks like

[
  { op: "speculate" },
  { dest: "one", op: "const", type: "int", value: 1 },
  { args: [ "x", "one" ], dest: "y", op: "add", type: "int" },
  { dest: "three", op: "const", type: "int", value: 50 },
  { args: [ "x", "three" ], dest: "cond", op: "lt", type: "bool" },
  { dest: "condition", op: "id", type: "bool", args: [ "cond" ] },
  { op: "guard", args: [ "condition" ], labels: [ "failed" ] },
  { args: [ "y", "one" ], dest: "z", op: "sub", type: "int" },
  { op: "commit" },
  { args: [ "z" ], op: "ret" },
  { label: "failed" },
  { dest: "one", op: "const", type: "int", value: 1 },
  { args: [ "x", "one" ], dest: "y", op: "add", type: "int" },
  { dest: "three", op: "const", type: "int", value: 50 },
  { args: [ "x", "three" ], dest: "cond", op: "lt", type: "bool" },
  { dest: "a1", op: "const", type: "int", value: 8 },
  { args: [ "a1", "a1" ], dest: "a2", op: "add", type: "int" },
  { args: [ "a2", "a2" ], dest: "a3", op: "add", type: "int" },
  { args: [ "a3", "a3" ], dest: "a4", op: "add", type: "int" },
  { args: [ "cond" ], labels: [ "small", "big" ], op: "br" },
  { label: "small" },
  { args: [ "y", "one" ], dest: "z", op: "sub", type: "int" },
  { args: [ "z" ], op: "ret" },
  { label: "big" },
  { args: [ "y", "a4" ], dest: "z", op: "add", type: "int" },
  { args: [ "z" ], op: "ret" }
]

so it is clear that our algorithm successfully eliminates the dead code if we branch to the ".small" branch. If we run the program with argument = 45, the instruction count of our program is 686, whereas that of the original bril.ts is 726. However, if we run the program with argument being 100, the instruction count is 1911 vs 1606, since we need to execute extra instructions such as "guard". "speculate", and we also failed the guard so we waste executing instructions that will be rolled back eventually.

0 replies

vivianyyd · 2023-11-08T01:24:19Z

vivianyyd
Nov 8, 2023

Will and I worked on lesson 12 together.
Tracing/Profiling brili code
Trace stitcher

Summary

We first modified the brili interpreter to perform profiling. We originally coded the interpreter to create only traces, but we thought it would be interesting if we had interpreter do some profiling.
We then produced traces from the profiling information and generated speculate blocks for those traces.

Implementation details

For each instruction, the brili interpreter keeps track of the number of times it was executed. The instructions along with this information is then fed into our Bril compiler.
Our Bril compiler then follows "traces" within each CFG: That is, it finds the hottest path through the CFG. One downside of this is that we lost the "free" function-inlining from tracing, as CFGs only represent one function.
These traces are then optimized and stitched back into the program using our compiler.
We tested on a few benchmarks in the core Bril benchmark directory. We compared code compiled and optimized without tracing to code with optimized traces. We used different inputs for profiling/tracing than for the final evaluation.
Our performance seemed to improve on a few test cases: birthday.bril (484 -> 400 instructions executed) and fizz-buzz.bril (3633 -> 3631)
However, recfact.bril (104 -> 138) performance got worse, likely since this is a recursive function, and our tracing does not handle inlining. So, we likely were running additional speculate and commit instructions without seeing great payoff.

What was the hardest part of the task? How did you solve this problem?

We had some issues dealing with control flow after the trace executes, stitching together CFG nodes.
For example, conditions should be negated with the second label of a branch instruction was executed.

0 replies

keikun555 · 2023-11-08T02:46:44Z

keikun555
Nov 8, 2023

Summarize what you did.

Lesson 12 Task
Bril Tracer: Creates a trace with extra information such as that start and end index of the instructions we traced and the start and end label names it used.
jit.py (stitcher): takes in a filepath to the trace and uses the instructions and the extra metadata (start and end indices and labels) to stitch the trace into the program.

Explain how you know your implementation works—how did you test it? Which test inputs did you use? Do you have any quantitative results to report?

This turnt toml file for differential analysis
The differential analysis passed
Tested multiple outputs on three benchmark programs: core/pythagorean_triple.bril, core/perfect.bril, and core/loopfact.bril
These programs were chosen from the programs with the largest trace
For each chosen program, I made a brench toml configuration file like this
At first the arguments were {args}, and I modified them to their current value in the second run
I ran the command akin to brench -j ``nproc`` {filename}.toml | tee -a {filename}.csv

pythagorean_triple.csv

benchmark	run	result
pythagorean_triple	baseline	61518
pythagorean_triple	jit-lvn-tdce	61526
benchmark	run	result
pythagorean_triple	baseline	56656
pythagorean_triple	jit-lvn-tdce	56664

perfect.csv

benchmark	run	result
perfect	baseline	232
perfect	jit-lvn-tdce	242
benchmark	run	result
perfect	baseline	217
perfect	jit-lvn-tdce	227

loopfact.csv

benchmark	run	result
loopfact	baseline	116
loopfact	jit-lvn-tdce	190
benchmark	run	result
loopfact	baseline	142
loopfact	jit-lvn-tdce	212

What was the hardest part of the task? How did you solve this problem?

It was difficult figuring out what instructions cannot be part of a trace
Instructions with sideeffects cannot be part of the trace, such as alloc or print. Especially print because there is no easy way to recall a console.log call in brili
call is also one of them, not only because of possible sideeffects, but also because we could override variable names if a function has the same variable names as the caller.
I also forgot to add a guard statement at a br when the condition was false. This made a peculiar phenomenon where in for loops, we are ok before the original trace loop ends, but not ok after the traced loop ends. The solution was to inject another boolean and use that to force out of the loop.

0 replies

bennyrubin · 2023-11-09T18:50:26Z

bennyrubin
Nov 9, 2023

Summarize what you did

I collaborated with @stephenverderame on this assignment.
We implemented a tracing JIT for bril.

Our code and examples can be found here.

Implementation details

For this task, we decided to trace at basic block boundaries. This is simply because we did not want to deal with inserting new labels and jumps for stitching the trace into the program. We also stopped tracing at prints and function calls, because prints induce a side effect and we did not want to inline functions for this assignment. We first run the program to completion in the interpreter and trace the entire output. Our JIT allow the user to specify how many basic blocks they want to trace and will create a trace up to that many basic blocks (stopping of course at prints and calls). Because we start tracing at the beginning of main, we always start the program at the trace. We created a new label that points to the beginning of main, and all conditionals turn to guards with that label so that if we abort we just run the program normally. Because of the way we formulated the trace to always stop at basic block boundaries, the last instruction of the trace will always be a jump to the correct basic block in the original program, allowing us to do no modifications to either the trace or the original programs in order to stitch them together.

Testing

For testing, we ran armstrong, collatz, perfect, and loopfact with and without tracing,
while performing the same optimizations (LVN, DCE, aggressive copy coalescing).
We ensured that the results produced were the same for both.

We also tested each program with different arguments such that we tested runs
that can follow the trace and runs that must abort the trace before committing. There were some scenarios when a different argument (other than the one used to generate the trace) was also able to take advantage of certain traces.

We summarized the results of the number of dynamic instructions in the following
table where we traced perfect with an argument of 496 and loopfact with an argument
of 3. We only show perfect and loopfact since collatz and armstrong had calls early on,
which caused us to stop tracing.

Test Name	Args	Traced	Regular	Speedup
Perfect	496	237	230	0.971
Perfect	500	238	231	0.971
Perfect	128	124	117	0.944
Loopfact	10	68	75	1.102
Loopfact	3	26	26	1.000
Loopfact	1	21	12	0.571

what was challenging

When we started, we thought this task would be very difficult. It seemed like there were a lot of uncertainties for how to do things like stitching the trace back into the program, especially in the middle of a basic block or in the middle of a function. In the end, we formulated the tracing to follow basic blocks so that the stitching was extremely straightforward and easy to reason about.

We ran into a number of challenges along the way:

We allowed calls/prints in the trace originally, but calls are not allowed in speculate mode and print changes the state in a way that can't be rolled back. We tinkered with the idea of changing the interpreter to keep track of a list of printed values and print at a commit or throw it away at abort, but decided not to implement this. Instead of stopping the trace right at a call or print, we instead stop at the previous basic block boundary for the reasons stated above.
Because we decided not to implement function inlining, so this introduced a new challenge of trying to find a good benchmark that had lots we could trace without any function calls or print statements. We eventually settled on perfect (which was perfect for our usecase) and loopfact.
One final thing we briefly overlooked, wondering why our tracing wasn't working at all, was that we had to look at which branch was taken for each conditional and insert a negation to the guard condition if the false branch was taken.

0 replies

obhalerao · 2023-11-09T20:16:58Z

obhalerao
Nov 9, 2023

I worked with @SanjitBasker on this assignment

Summary

Repo link

Implementation Details

For this assignment, we implemented a JIT compiler for Bril that injects speculative traces into a Bril program while interpreting.

To generate these traces, we modified the TypeScript Bril interpreter to start tracing at the start of the program and periodically stop tracing. We chose to stop the tracing during the following instances: when print was called, when a function was called, when any memory instruction was called (alloc, free, load, store, though in hindsight it might not have been strictly necessary to include load), and when a backedge was reached (we deduced when this occurred by adding indices to each instruction and comparing the indices of consecutively executed instructions in the same function). We also stop tracing when a function is returned from.

While we periodically keep track of these traces throughout the entire execution of the program, we only output one of these traces at the end. We chose to output the first nontrivial trace (which we defined as a trace that contains a branch instruction) that we come across, or if none exist, the last trace that we compute. For branch instructions within a trace, we also add an additional field that denotes whether or not the branch was taken.

Now, to stitch the trace with the original program, we first manually added the indices of all the instructions to the original program. Then, to find the correct position to insert the trace into, we simply compared the position of the first index in the trace to the relevant index in the program to insert. We start speculating, insert all the instructions from the trace while removing jumps and replacing branches with guards of the appropriate variable (with appropriate boolean variables injected if branch was not taken), add a failure label to abort to if the speculation fails, and add a success label to jump to if the speculation succeeds. If the speculation fails, the program will fall back on its original code. We then remove the indices we added to ensure proper parsing and output the new program.

Testing

We tested our JIT compiler on a few different benchmarks with different inputs and measured the dynamic instruction counts of the baseline and traced versions of each benchmark on the given input. The ones we chose were catalan, collatz, and primes-between. Our results are shown below:

Test Name	Input	Baseline	Traced
catalan	10	659378	875891
catalan	13	17803271	23649122
collatz	7	169	297
collatz	31	1069	1871
primes-between	50 100	6621	6625
primes-between	24 124	13450	13454

From this, we can see that while the catalan and collatz benchmarks have potential to abort tracing frequently, resulting in a worse performance, the primes-between benchmark only performs slightly worse with tracing. In a real-world JIT scenario, the traced code would be run much faster than the rest of the interpreted code, which means that primes-between would likely have a faster runtime with the trace our JIT compiler chose than the baseline.

Difficulties

Initially, we tried to construct all possible traces before realizing that this would be too intractable. Besides that and the issue of dealing with trivial traces, though, we found this assignment to be one of the more straightforward ones.

0 replies

matth2k · 2023-11-10T00:28:53Z

matth2k
Nov 10, 2023

Summary
- Offline Speculative Execution
  - bril-br.ts counts the number of speculative executions that complete versus are aborted.
  - brili.ts prints the trace for a program as it is interpreted.
  - findHotPath.py analyzes the trace in a brute force way to find the hottest path in the CFG.
  - insertTrace.py takes in the hotpath of a function as arguments and carries out the program transform needed for speculative execution.
  - workingExamples.sh is a script that runs a selection of the Bril benchmarks and records the results relevant to speculative execution.
Implementation Details
- I used the reference brili interpreter written in TypeScript as a baseline. Then, I made two forks of it.
  - One fork is used for printing a program trace to terminal.
  - One is used to help me collect results after the fact. It prints out the number of commits and aborts observed during execution.
- I wrote a helper program to analyze that trace: findHotPath.py.
  - This is probably the weakest part of my implementation. I greedily start out with the hottest block in the CFG then iterate off of it until the path I am building gets too cold.
- I have a compiler pass insertTrace.py that takes in this hot path as an input and transforms the Bril program.
  - This is the most thoughtful part of my implementation. I think I nailed merging the blocks in a trace while preserving the correct control flow and also avoiding speculating on unsafe instructions like ret.
  - The merged block is inserted where the hot path begins. All predecessors of the hot path are redirected to the speculative block. The speculative block will abort to the header of the hot path, and the speculative block will commit and return to the successor of the hot path. Of course, you need to remember that you need to negative the condition on speculated "else" branches.
  - With this technique, I always had functionally correct benchmarks. The previous control-flow graph API I made in other assignments really sped up this part.

Evaluation

As long as we are speculating inside the same interpreter, we will not be getting any program speedups. We would need to leverage extra optimization. For instance, we could strengthen the guard statements and then use common subexpression elimination. Otherwise, I don't have any empirical speedups to report in this assignment. The real measure of success for this lesson is to hope that the speculative programs are only slightly slower. The logic being that if our hot path is truly hot, we won’t need to abort often. Hence, the speculative program should be only a little bit slower than the baseline if we find a good trace.

With that being said, we can observe how well my speculative execution performs by comparing the relative number of aborts to commits as well as counting the number of instruction lying in the speculative execution path:

Benchmark	Blocks in Trace	Instrs in Trace	Commits	Aborts
quadratic	3	338	2	4
mod_inv	2	15	14	1
loopfact	2	79	1	2
fizz-buzz	2	28	100	1
factors	2	10	7	1
check-primes	3	826	3	244
birthday	2	407	1	2
armstrong	2	15	3	3
cordic	2	27	8	1
n_root	2	17	20	20
norm	2	9	20	0
sqrt	5	231	1	3

The results are pretty diverse, we have some good results like mod-inv and cordic. While others like check-primes are clearly bad traces to optimize off of. The theoretical speedup of the program lies in the amount of instructions saved. With that intution, we can calculate a very rough approximate for max feasible speedup assuming that:

Every instruction takes the same amount of time to execute
The commit vs abort count I measured is representative of executions with different arguments
Aborts themselves have no overhead
The dynamic compilation time is not accounted into the cost

Speedup = Baseline / (Baseline - Instruction in Trace * No. Commits)

Benchmark	Baseline (Instrs)	Speculative (Instrs)	Max Feasible Speedup
quadratic	785	881	7.2x
mod_inv	558	599	1.6x
loopfact	116	135	1.5x
fizz-buzz	3652	3937	4.3x
factors	72	90	36x
check-primes	8468	55135	1.4x
birthday	484	497	6.3x
armstrong	133	154	1.5x
cordic	517	540	1.7x
n_root	733	793	1.9x
norm	505	545	1.6x
sqrt	322	418	3.5x

These speedup calculations are a little bit silly though, because the best speedups are on tiny programs where the overhead of dynamic compilation would be the majority of the cost anyways.

Anything Hard or Interesting?
- Overall, I think the difficulty in this assignment is chosing reasonable heuristics that are both (1) easy enough to program and (2) actually capture enough traces to optimize.
- Finally, given these limitations it was hard to get enough measurements to make a conclusion about my implementation.

0 replies

bcarlet · 2023-11-10T02:31:47Z

bcarlet
Nov 10, 2023

Summary

Details

I modified the brili interpreter to record traces of a given function. While tracing, the interpreter inserts speculate, guard, and commit instructions into the trace as appropriate. I then wrote a basic utility to replace the body of a given function with a trace, while falling back to the original body if speculation aborts.

Testing

I tested the transformations on the armstrong, check-primes, digital-root, and sum-sq-diff benchmarks. To test correctness, I tested the transformed program on several inputs; where possible, I selected inputs that would cause speculation to sometimes succeed and sometimes fail.

The following table contains selected performance measurements. Bold indicates the arguments used for tracing.

Benchmark	Arguments	Speedup
armstrong	407	0.957
armstrong	444	0.957
armstrong	4444	0.689
sum-sq-diff	100	1.033
sum-sq-diff	67	0.655

Difficulties

The tricky part of this task for me was deciding when to start/stop tracing so that the trace could be easily stitched back into the program. I ended up performing tracing at the granularity of entire functions to make stitching simpler.

0 replies

yxd97 · 2023-11-10T03:33:36Z

yxd97
Nov 10, 2023

Summary

Implementation Details

I first modified the reference interpreter to produce a full trace of the executed Bril program. To separate it from the output of the program itself, traces are dumped to stderr.
The optimizer reads the trace line by line, and keeps track of occurances of labels. If any label is hit for more than 5 times, the trace between its first occurance and the current instruction is considered as a "hot trace". A hot trace will be inserted right before the reccuring label, with all guards points to that label on abortion. This results in a program with a trace right before a loop header.

# baseline
@main {
    one: int = const 1;
    bound: int = const 10;
    sum: int = const 0;
    i: int = const 0;
.header:
    cond: bool = lt i bound;
    br cond .body .exit;
.body:
    sum: int = add sum i;
    i: int = add i one;
    jmp .header;
.exit:
    print sum;
}

# optimized
@main {
  one: int = const 1;
  bound: int = const 10;
  sum: int = const 0;
  i: int = const 0;
  
  speculate;
  cond: bool = lt i bound;
  guard cond .header;
  sum: int = add sum i;
  i: int = add i one;
  cond: bool = lt i bound;
  guard cond .header;
  sum: int = add sum i;
  i: int = add i one;
  # ... 4 more repetitions
  commit;

.header:
  cond: bool = lt i bound;
  br cond .body .exit;
.body:
  sum: int = add sum i;
  i: int = add i one;
  jmp .header;
.exit:
  print sum;
}

After inserting a hot trace into the program, the optimizer clears out the label occurance buffer and continues reading the rest of the full trace. For a loop with a large enough bound, multiple speculate; -- commit; regions will be inserted. This could reduce the risk of loosing progress due to a late abortion.
The optimizer does not support inter procedual tracing. It will always analyze the main function. Whenever it hits a call instruction, the label occurance buffer is cleared.

Testing

Due to the limitations of the optimizer, I only tested my implementation on several hand-picked programs. They can be found under tests of my code base.

birthday-inline is the birthday program from Bril core benchmarks, with the core function manually inlined into main.
fizz-buzz, gcd, perfect, and reverse are directly from Bril core benchmarks
simple_loop contains a for loop that sums up 0 to 9.
call contains two for loops. The first one does the same as simple_loop; the second one calls a function in the loop body to prevent optimization.
loop_with_branch sums up numbers that are multiples of 3 from 0 to 999.

Challenges

The biggest challenge to me is to find a correct hot trace from the traces. A loop in the trace could appear like repetitions of "header - body" or "body - header", while only the first is correct. This fact requires me to start scanning from the very beginning to leverage the fact that the loop header dominates other blocks in the loop. After ensureing the hot trace always starts from the header, I can safely insert it before the label of the actual loop header.

0 replies

NgaiJustin · 2023-11-10T03:39:26Z

NgaiJustin
Nov 10, 2023

Summary

Repo (link)

Details

I first modified the brili.ts and created a brili-trace.ts that would only log the instructions that are executed. I did this by modifying the state passed through the layers of eval functions. When encountering br or jmp, I made sure to log additional guards to ensure that when this trace is stitched into the program, we would abort the speculation when the path deviates from the trace.
I then created a stitch_trace.py that would take in a trace file and a bril program and stitch the trace into the program. Since most of the heavy lifting was done already when creating the trace, this was a simple task of inserting the trace into the program. I also made sure to create the dummy labels for the guards to short circuit to at the corresponding spot in the original program.

Testing/Evaluation

I first manually verified that the result of stitching the trace is what I expect by verifying the CFG. I also verify that the output of the stitched trace is the same as the output of the original program. Below an example the CFG of a program and the CFG after stitched with its trace.

Before	After
`bril2json < lesson_tasks/l12/sample/basic.bril \| python3 cfg.py`	`bril2json < lesson_tasks/l12/sample/basic.bril \| python3 stitch_trace.py -t=./lesson_tasks/l12/traces/basic-trace.txt \| python3 cfg.py`

I then used the bril2json and brili tools to verify that the output of the stitched trace is the same as the output of the original program. I verified for basic.bril, gcd.bril, fib.bril, and fact.bril.

Difficulties

The most difficult pat was probably finding the right spots in the original program to insert the trace was a bit tricky. I had to make sure that the dummy labels were inserted at positions where the specuation could break out to and retain the same behavior as the original program.

0 replies

ryanwmao · 2023-11-10T04:09:52Z

ryanwmao
Nov 10, 2023

@xalbt , he-andy, and I worked together on lesson 12.

Summary

tracing jit

Implementation

We used the brili.ts code from the repository as a starting point. We renamed to brili-ins.ts
The bril-ts folder from the repository is also included in our local repo
We started tracing at the beginning of the program, and followed a pretty aggressive stopping routine: any function call, backedge, return. We don't support memory instructions and we didn't test on the mem test suite.
We wrote a C++ program to stitch the trace into the bril program.

Difficulties

Although the entire process intuitively made sense, we had some difficulty deciding on a good heuristic to identify "hot paths."
We also had a few bugs with the trace stitching, especially where to insert and break out from the speculation.

Testing

...

0 replies

rcplane · 2023-11-10T04:23:39Z

rcplane
Nov 10, 2023

@rcplane and @zachary-kent worked together

Summary

We successfully captured, optimized, and measured dynamic instruction counts for interpreted bril programs.

Implementation

We modified brili.ts to:
- accept optional argument to capture traces to a file
- Like TraceMonkey, we wanted to trace hot inner loop bodies. That is, we wanted to begin tracing after traversing a back-edge in the CFG. However, without precise dominators information in the interpreter, we would have had to re-implement dominator computation in typescript (or implement a Bril interpreter in Haskell) to gain this information. As a heuristic, we decided that branches "upwards" in the program are likely to be back edges, so we begin tracing when traversing an upwards jump or branch. We stop recording once we traverse another back edge, as this means we have likely traced the entire loop body. For every function in the program, we maintain a mapping between labels and traces beginning at those labels. If, during tracing, we execute a label found in this mapping, this means that we are likely
  tracing an outer loop, so we abort; we reasoned that tracing hot inner loops would be more fruitful, and managing nested traces is difficult. Further, if we encounter an instruction (like print or memory operations), we abort tracing. Building the trace itself using speculation was fairly straightforward and followed the recipe presented in class; the trace begins with a speculate instruction and ends with a commit, followed by an unconditional jump to the target of the final back edge in the trace. Jumps do not appear in this trace, and branches are translated into guards.
- After profiling an entire Bril program, we have accumulated all the traces we want to stitch into that program.
  For every function in the program, we do the following:
  1. Insert an explicit return at the end of the function. We place our traces at the end of the function, so any implicit return would cause control flow to inadvertently fall into the traces.
  2. For every trace t starting at label l in this function, do the following:
  3. After label l, insert an unconditional jump to l.trace, the label where the instructions of the trace will be located.
  4. After this jump, insert a label l.recover. A failing guard in the trace will jump to l.recover.
  5. At the end of the function, insert the label l.trace. Then, insert all of the instructions in the trace t.
- accept more arguments for multiple input sets, after creating an args file like benchmarks/core/birthday.args with one line per argument set, brili -i benchmarks/core/birthday.args will repeat the program evaluation for each input argument set reusing input bril program and other provided argument flags
Leveraged Haskell bril-hs dce and lvn to optimize traces. We already had an effective implementation of LVN and DCE in Haskell, and applying them to the stitched program to optimize every trace.

Tests and Results

initial tests with benchmarks/core/gcd.bril and mem/mat-mul.bril using test/l12/trace.sh
compare.sh script and raw compare output shows correct outputs running benchmark default and from emitted stitched trace bril json.
Unfortunately, we found that the overhead due to tracing was large. Although LVN was in theory effective, upon examining the traces produced we found that they contained few additional opportunities for copy/constant propagation.
Even if the code always stays "on trace", there is still the overhead present from jumping to the trace and then back out.
In the future, we would likely want to support the "trace trees" described in TraceMonkey to allow for nested traces and hot "side exits".

Benchmark Name	Input Value(s)	Results	Dyn Instr Count Default	Dyn Instr Count Default + DCE + LVN	Dyn Instr Count from Trace + DCE + LVN
birthday	23 traced	match	484	254 '*'	325
birthday	22	match	463	243 '*'	311
birthday	50	match	1051	551 '*'	703
gcd	4 20 traced	match	46 '*'	46 '*'	63
gcd	4 80	match	181 '*'	181 '*'	243
gcd	3 21	match	64 '*'	64 '*'	87
perfect	496 traced	match	232	231 '*'	371
perfect	28	match	58	57 '*'	98
perfect	12	match	37	36 '*'	67

Difficulties

We initially started development on a simpler form of brili.ts trace logging with a procedural context planning to stitch traces later with python but then adapted to a tracing scheme triggered on loop back edges aiming to optimize for concentration of branching or loopins control flow. This changed our interpreter tracing conditions, and while working further in typescript we decided that simply stitching and emitted full trace files at once made the most sense for ease of use.
We initially attempted memory heap copy upon speculation but ran into difficulties with parital incorrect output of mat-mul even on same traced input. As this was out of scope for the assignment we decided to focus on core tracing features and proof of correctness instead.

Generative AI

Both of us used GitHub Copilot throughout the task, which was fairly helpful, especially for generating documentation. For example, the following spec was generated using Copilot:

/**
* Stitch all traces of `state.traceMap` into `prog`, modifying `prog` in place.
* 
* @param prog The program to modify
* @param state the state produced by executing `prog`
*/

Copilot was also able to generate the implementation of very simple functions, such as the following:
```
const addTracePostfix = (label: string) => `${label}.trace`;
```
It faired worse, however, on more complex code (like the stitch function). Its suggestions were frequently incorrect.
We used ChatGPT to figure out how to run a stack executable from a different directory; it told us you can use the --stack-yaml option to specify a path to the project, which was correct.
Chat GPT4 April 2023 was used for Deno typescript function formatting and I/O function call advice.

0 replies

AliceSzzze · 2023-11-10T04:48:56Z

AliceSzzze
Nov 10, 2023

@JohnDRubio, @20ashah and @AliceSzzze worked together on this task. Our implementation is here.

Implementation

We implemented a version of a tracing interpreter which used the method JIT policy of beginning tracing at the beginning of each function. The interpreter traces each function until it reaches an instruction that it needs to bail out of such as:

call
print
store
load
alloc

To bail out, we try to backtrack the trace to the last executed block, commit, then try to jump to the block we just bailed out of. This helps us trace the instructions executed up to the bailed out block, instead of not tracing at all whenever we encounter a problematic instruction. For example, we just executed block A in our trace, and we are now executing block B, where we see a print instruction. We preserve all the traced instructions up to the end of block A, commit, and then add a jump instruction to the original block B.

When we see a jmp instruction and we have traced a sufficient number of instructions (set to 150 in our implementation), we commit before the jmp and proceed as normal. This saves us the work of having to stitch code together if we end in the middle of a basic block. The 150 limit is a tunable parameter that we have not experimented with. I imagine that there is a trade-off between over-fitting and higher potential savings if the executions are very similar.

We do not modify functions with no br instructions, because all the code would just be executed anyway but we would be adding two extra instructions (speculate and commit).

Results

We tested the tracing JIT on all of the benchmarks in benchmarks/core directory using brench. We list the dynamic instruction count changes for baseline, tracing, and tracing + LVN + trivial DCE below. Unsurprisingly, tracing performed poorly on recursion-heavy benchmarks, as we did not inline functions and the path each of the recursive calls probably differs just enough to make the tracing inapplicable. The benchmarks that benefit from tracing usually benefit from the lack of jmp in the unrolled loop in the trace, and from the opportunities for LVN & TDCE optimizations that tracing unlocked (e.g. loop-invariant code removed from the trace).

Summary statistics

pythagorean_triple and reverse are omitted to due a timeout while tracing and a bug in the lvn+tdce code respectively.

run	baseline	traced	tracedLVNTDCE
count	37.0	37.000000	37.000000
mean	1.0	1.101070	0.965154
std	0.0	0.170796	0.271132
min	1.0	0.956522	0.362069
25%	1.0	1.000000	0.869565
50%	1.0	1.001095	1.000000
75%	1.0	1.165414	1.087336
max	1.0	1.595960	1.595960

plot of % dynamic instruction count change

breakdown of % dynamic instruction count change by benchmark

benchmark	baseline	traced	tracedLVNTDCE
ackermann	1.0	1.412208	1.412208
armstrong	1.0	1.165414	1.142857
binary-fmt	1.0	1.370000	1.370000
birthday	1.0	0.989669	0.456612
bitshift	1.0	1.143713	0.898204
bitwise-ops	1.0	1.000000	0.999408
catalan	1.0	1.268658	1.268658
check-primes	1.0	1.028578	0.713037
collatz	1.0	1.000000	1.000000
digital-root	1.0	1.000000	1.000000
dot-product	1.0	1.000000	1.000000
euclid	1.0	1.000000	0.740675
fact	1.0	1.445415	1.087336
factors	1.0	1.027778	1.027778
fitsinside	1.0	1.000000	1.000000
fizz-buzz	1.0	1.001095	0.602957
gcd	1.0	0.956522	0.869565
hanoi	1.0	1.595960	1.595960
is-decreasing	1.0	1.000000	0.944882
lcm	1.0	1.000860	1.000430
loopfact	1.0	0.956897	0.362069
mod_inv	1.0	1.005376	0.663082
orders	1.0	1.001308	1.001308
palindrome	1.0	1.338926	1.328859
pascals-row	1.0	1.000000	0.452055
perfect	1.0	1.056034	1.047414
primes-between	1.0	1.000003	1.000003
quadratic	1.0	1.007643	0.571975
recfact	1.0	1.355769	1.115385
rectangles-area-difference	1.0	1.000000	1.000000
relative-primes	1.0	1.392616	1.195008
sum-bits	1.0	1.000000	1.000000
sum-check	1.0	0.993224	0.993224
sum-divisors	1.0	1.025157	1.025157
sum-sq-diff	1.0	0.994404	0.673799
totient	1.0	1.000000	1.000000
up-arrow	1.0	1.206349	1.150794

Testing on different inputs (using the given args to generate traced code)

A benchmark that tracing did well on:
core/birthday.bril
input: 45 instead of 23
dynamic instruction count: 946 -> 463

A benchmark that tracing did badly on:
core/ackermann.bril
input: 2, 6 instead of 3,6
dynamic instruction count: 1015 -> 1442

Challenges

This was a fun task but there were definitely a few tricky parts. One challenge was correctly bailing out of traced code when a guard failed. Another tricky part was considering how to handle interprocedural tracing. We didn’t end up implementing interprocedural traces but after some discussion, we believe that interprocedural traces that would be worth using would be traces with calls to functions that do not contain branches or recursion. We believe this would prevent any “overfitting” - meaning cases where the traced function execution is too specific and is not likely to occur exactly the same way again. Moreover, there does not seem to be a good way to bail out of when you are a few recursive calls deep without copying the code from the inlined function altogether.

*updated after bug fix

0 replies

evanmwilliams · 2023-11-13T01:58:42Z

evanmwilliams
Nov 13, 2023

I worked on this task with @emwangs

Sumary

Here is the link to the repository: brili-tracing

Implementation

We started by modifying brili.ts to log instructions that are executed as done in class. We had to be a bit careful with br and jmp instructions.
For times sake, we chose not to support memory instructions in our implementation
We also created a Python program to stitch the traces into the Bril program. This was not too hard, since all we had to do was take the trace generated by the interpreter and put it into the right place in the Bril program. To deal with some branching, we had to create extra labels for guards but this was not too difficult

Testing and Evaluation

We were not super rigorous here. We just verified that the behavior of the program was the same on a few of the important benchmark files (namely, gcd.bril)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lesson 12: Dynamic Compilers #355

{{title}}

Replies: 13 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Lesson 12: Dynamic Compilers #355

sampsyo Aug 21, 2023 Maintainer

Replies: 13 comments

Summary

Hardest Part

Testing

Summarize what you did.

Explain how you know your implementation works—how did you test it? Which test inputs did you use? Do you have any quantitative results to report?

What was the hardest part of the task? How did you solve this problem?

Summarize what you did

Implementation details

Testing

what was challenging

Summary

Implementation Details

Testing

Difficulties

Summary

Details

Testing

Difficulties

Summary

Implementation Details

Testing

Challenges

Summary

Details

Testing/Evaluation

Difficulties

Summary

Implementation

Difficulties

Testing

Summary

Implementation

Tests and Results

Difficulties

Generative AI

Implementation

Results

Summary statistics

plot of % dynamic instruction count change

breakdown of % dynamic instruction count change by benchmark

Testing on different inputs (using the given args to generate traced code)

Challenges

Sumary

Implementation

Testing and Evaluation

sampsyo
Aug 21, 2023
Maintainer