Lesson 7: LLVM #294

sampsyo · 2022-02-18T16:40:00Z

sampsyo
Feb 18, 2022
Maintainer

Share your experiences getting started with writing an LLVM pass here!

5hubh4m · 2022-03-11T22:33:39Z

5hubh4m
Mar 11, 2022

Getting Started with LLVM (Lesson 7)

To get started with LLVM I implemented a simple LLVM pass that goes over each instruction in a program and checks if the instruction is commutative. If so, it swaps the order of instructions.

The pass is in commutative/commutative_pass/CommutativePass.cxx.

Testing

I implemented a simple C program to check the effect of the pass in [commutative/test/prog.c].
We can generate the LLVM code for this C program and test the effect of our pass.

First build the pass.

$> cd commutative/
$> mkdir -p build && cd build
$> cmake ..
$> make

Test the pass

$> cd commutative/test
$> clang -emit-llvm -S -o - prog.c > prog.ll
$> opt -S --enable-new-pm=0 -load ../build/commutative_pass/libCommutativePass.so -commutative < prog.ll > prog_with_pass.ll
$> diff prog.ll prog_with_pass.ll
...
42c42
<   %20 = add nsw i32 %19, %18
---
>   %20 = add nsw i32 %18, %19

Optionally, build the processed LLVM and compare the output.

$> cd commutative/test
$> llc -filetype=obj prog.ll -o prog.o
$> clang prog.o -o prog
$> llc -filetype=obj prog_with_pass.ll -o prog_with_pass.o
$> clang prog_with_pass.o -o prog_with_pass
$> ./prog
The 14th fibbonacci number is 377.
$> ./prog_with_pass
The 14th fibbonacci number is 377.

Experience

I found LLVM infrastructure to be pretty good, other than installing it, of course. Once I figured out the --enable-new-pm=0 trick (-flegacy-pass-manager for clang) it was smooth sailing. The API website is very hard to navigate but once I made Vim be able to see the header files, I could find methods available on an object using just autocomplete.

1 reply

sampsyo Mar 13, 2022
Maintainer Author

Looks awesome! Setting up your editor to do autocomplete on the full LLVM headers is a great idea, if you can swing it. Doxygen may not be the friendliest thing in the world to navigate, but it sure is better than nothing…

JonathanDLTran · 2022-03-14T22:36:59Z

JonathanDLTran
Mar 14, 2022

Summary

My code is here: llvm.

I implemented a simplified version of what inling would look like, except that to make it simpler to implement, I limited the function to inline into to be main, and functions that would be inlined could only be 1 basic block long and have exactly 1 return instruction. Further, the functions to inline cannot contain any other calls to any other functions. All these conditions trivialize the pass I have written. To make it more robust, one might want to consider functions that have multiple basic blocks, many exit point and possibly has calls to other functions, that does not form a loop in the call graph.

I did this using a module pass, where I first read over all functions and check if it is main or not. For functions that are not main, I check if they satisfy the conditions I mentioned previously. For those functions that satisfy the conditions I wanted, I then inline the function into the main function, by setting up a builder, then copying instructions one by one. I make sure to change the arguments of the functions to be the values from the main function, and then I set the return function's arguments as well to change its uses to be the correct ones in the main function.

Testing

I tried the program on a simple test case, where the function to be inlined was a simple addition function, that adds 2 integers and returns the sum. This addition function can be seen in test.c. To run this, try bash run.bash test.c, which prints out 3 results of 3 calls. Note that the 2 calls to add are fully inlined (it comes out to be about 7 instructions for each to be inlined).

I then try on a real world program, specifically on conway's game of life. I use an implementation from https://rosettacode.org/wiki/Conway%27s_Game_of_Life#C, and all credit should go to this source.

Running bash run.bash conway.c will execute the program, showing the game working as intended.
There's no effect by the inling on the conway program as my inlining program is very simple, and rejects any inling attempts for this conway program.

Challenges

LLVM was challenging to work with. Reading the LLVM documentation was tricky, as I often had to search various LLVM doxygen pages until I found the information needed. In terms of using LLVM, there were several troubles I encountered. First, I realized late on that I needed to clone instructions, rather than to reuse them. Next, I also realized I had to insert instructions after iteration, because that would avoid changing the iterator. I often forgot about checking for terminator blocks, which made some bugs initially occur. Removing instructions is also difficult, as one needs to remove all of the uses of that instruction before the instruction can be deleted. Finally, trying to figure out how to set up a module pass was tricky. I searched online, found a stackoverflow post, and that post finally led me back to Issue #7 on the llvm for grad students repository.

1 reply

sampsyo Mar 17, 2022
Maintainer Author

Wow; I think this is super cool! Despite all the restrictions, making the inlining work required getting pretty familiar with the internals of the LLVM in-memory representation (not that you're any stranger to that). Awesome work getting it going!

charles-rs · 2022-03-15T06:29:17Z

charles-rs
Mar 15, 2022

Code here

Summary

I decided to use LLVM to instrument a program to generate a memory trace. I have previously done this using Intel PIN as a TA for CS3410, as we needed a realistic multithreaded memory trace.

I followed the tutorial pretty closely, and used the documentation to look up how to identify load and store instructions. For each of these, I inserted a call to a logging function which I implemented in C. This function prints threadID, read/write, and the address to trace.txt It also grabs a mutex for this because it should.

Testing

I ran the multithreaded matrix multiplication I used in 3410 through, and it seemed to work fine. The original execution still worked, and the trace looked reasonable.

Experience

The experience of using LLVM was much more enjoyable and painless than using PIN (not to say PIN is bad, but it was nice to operate at a slightly higher level). That said, the memory trace is only accurate for non register allocated code, since LLVM hasn't yet gotten things off the stack and into registers.

One challenge I faced was that LLVM is a little more strict than C about pointer types. To produce the trace, i wanted to print the actual memory address, meaning it could be any pointer type (I used void* in the C code) For LLVM to be happy, I had to declare that it was an int*, and then cast any other pointer to an int*, but clang gave pretty clear error messages so this wasn't that bad.

1 reply

sampsyo Mar 17, 2022
Maintainer Author

Very nice; this looks super great!!

As some additional context about that annoyance regarding pointer types, the LLVM developers have also realized that this strictness is probably causing more problems than it's solving. There is an in-flight effort, which has been going on for some time now, to replace all pointer types with a single "opaque" pointer type, ptr. You can read more about what the IR will look like in the documentation.

andreyyao · 2022-03-16T02:58:03Z

andreyyao
Mar 16, 2022

Summary

My code can be found here. I basically implemented deoptimizer.cpp, which yeets all the integer multiplication instructions and replaces them with equivalent for-loop additions. Because I am introducing for-loops, it involved more than just replacing instructions in place. I had to insert multiple blocks as well as phi functions. Also, there is no "absolute value" instruction in llvm so I have to make the program choose an incrementer, which is called offset in the pass implementation.

Walkthrough of implementation:

Traverse through all blocks. If block B has a mul instruction, do the following.

Use SplitBlock to split the instructions starting from the mul into a new block afterBlock.
Construct diamond arrangement of blocks diamondN, diamondW, diamondE, diamondS to basically choose the right incrementer offset based on the sign of the lhs, which is lhs operand of mul. Jump from afterBlock to diamondN
Construct the forLoop block. Add phi functions and branches.
Change successor of forLoop to old successors of B
Rename phi functions in old successors of B to use new result of add instead of the old mul.
Then repeat.
Here's a diagram for illustration(written in bril-style notation):

Why deoptimize?

Not many good reasons. Maybe some target instruction set doesn't have integer multiplication? It can also be used for testing certain optimizations, especially ones which might turn repeated addition into multiplication...

Testing

I found a matrix multiplication program matmul.c in C online and just copied it(with credits commented at top of file, of course). The program actually times itself, so it saved me the trouble of profiling it myself. Then I ran my pass on the program and compared the compiled program's runtime against that of the original program. However when I turn on "-flegacy-pass-manager" flag of clang in order to see the stderr output, clang or llvm hangs and I still haven't figured out why.

Experience

LLVM and Clang was rather surprisingly easy to setup. I just downloaded them from my OS's package manager (pacman on Manjaro Linux), and voila they basically worked out of the box besides the stderr not showing up at first. One thing that was actually quite helpful was the type system involved in llvm. My editor would complain frequently about my code when it was wrong. One thing that was difficult is to figure what a function does from its name alone. For example, IrBuilder has a million overloaded constructors and some of them behave slightly differently. For example, I believe if constructed with a block as argument it will append instructions to end of block, but for an instruction it will insert right before the instruction... I'm sure this will be less of a problem as one gets more familiar though. Also the functions are kinda all over the place. Some functions have two almost identical versions, but one is deprecated so I have to explicitly type annotate a null pointer to disambiguate the function call. One very very nice thing is SSA!! Not having to worry about conflicting variable names, for example, allows one to just concentrate on the (de)optimization. Finally, phi nodes(why are they called nodes?) rock! I found this in the PHINode class documentation:

PHINode - The PHINode class is used to represent the magical mystical PHI node, that can not exist in nature, but can be synthesized in a computer scientist's overactive imagination.

1 reply

sampsyo Mar 17, 2022
Maintainer Author

Wow; I love this idea! Super creative thinking. Replacing individual instructions with entire CFG subgraphs required you to get pretty familiar with the various utilities for manipulating LLVM IR, starting with that whole "split block" business.

However when I turn on "-flegacy-pass-manager" flag of clang in order to see the stderr output, clang or llvm hangs and I still haven't figured out why.

Unfortunately, this probably means that your pass probably isn't running at all without that flag—so you may have an infinite loop in your pass. You might consider running your pass (i.e., running clang with your pass enabled) inside a debugger (gdb or lldb) and seeing where it's getting stuck.

phi nodes(why are they called nodes?)

Because, in SSA, instructions are just nodes in a data-flow graph! So there are + nodes, * nodes, and so on, but also phi nodes.

anshumanmohan · 2022-03-16T03:18:30Z

anshumanmohan
Mar 16, 2022

My code is here. I went for the unambitious route: I print a message every time I encounter a load instruction!

I modified Skeleton.cpp as needed, wrote a simple runtime library that is only compiled once, and generated a few example cases. You may notice an attempt to use Turnt; this was eventually scaled down. Turnt is now just used to generate .o files for all the example cases in one go. Sadly it is still necessary to link the .o files and run the exe files by hand.

You may also notice that my examples have a float/int division slant. Indeed, my initial goal was to print out one thing for integer divisions and another thing for float divisions. Doesn't seem all that hard, and I'm sure I will be shown how easy it is, but I did get a bit lost in the LLVM jungle. I found it tricky to eke out the kind of operand that a Binary Operator was working with. In any case, I ended up going with Instruction::Load, an operand that I saw being used in examples.

1 reply

sampsyo Mar 17, 2022
Maintainer Author

All sounds good; thanks for the summary! For your future information, LLVM (like Bril) has two separate instructions for integer divide and FP divide: udiv/sdiv vs. fdiv. So you needn't inspect the types of the arguments—just look at the opcode of the instruction itself.

alaiasolkobreslin · 2022-03-16T03:24:34Z

alaiasolkobreslin
Mar 16, 2022

Code is here.

Implementation

My (unambitious) implementation involved inserting a print statement saying "I saw a plus!" after every integer add operation. I did this by attempting to cast each instruction as BinaryOperator, and if that was successful, I checked if the instruction's opCode was equal to 13. If it was, I added inserted a call to a function that would print "I saw a plus!" (in the same way we did with rtlib.c in lecture).

Testing

I tested my implementation on a merge sort example, which seemed to work correctly.

1 reply

sampsyo Mar 17, 2022
Maintainer Author

All sounds good; seems right to me! FWIW, in the future, you might want to avoid hard-coding magic constants like 13 and instead use the BinaryOps enum, where you can use proper names like Add:
https://github.com/llvm/llvm-project/blob/ddb85f34f534ed74312ef91e4c1f8792ad8f08f0/llvm/include/llvm/IR/Instruction.def#L147

andrewb1999 · 2022-03-16T03:35:30Z

andrewb1999
Mar 16, 2022

My LLVM pass is here.

Summary

I implemented a simple pass that converts multiplies by a power of 2 constant to a shift left. On some hardware shift left is significantly faster than multiply. Implementing this pass was pretty straight forward, but required using some LLVM utilities to determine if the multiply included a constant power of 2.

Evaluation

I wrote a simple c program that multiplies numbers from 0 to 1024 by 8 to confirm that my optimization works. Unfortunately, even when increasing the number of multiplies, I wasn't able to see a noticeable difference in performance with my optimization. I believe this comes down to the specific details of the intel x86 implementation, as well as optimizations that occur during llvm code generation. To help limit the optimization during llvm code generation, I changed the multiply to a non-power of 2 and still could not notice any improvement in overall performance. One possibility is that the performance hit of multiplication was hidden by the loop instructions because of how out-of-order processors work.

Experience

The most challenging part of this experience was getting a working combination of llvm and clang. It appears that the default versions of llvm and clang on arch linux have slightly different build options that prevented the skeleton code from working. I ended up having to build and install llvm and clang from scratch which put a pretty severe load on my laptop. I was able to limit that by building in release mode which significantly reduces the amount of memory needed during linking.

1 reply

sampsyo Mar 17, 2022
Maintainer Author

Super great! FWIW, doing this kind of conversion (from a more complex/general operation to a specialized/narrower operation) is often called "strength reduction." Converting power-of-two-integer-multiply to left-shift is a canonical example thereof.

It's valiant of you to try measuring wall-clock performance improvement! Indeed, I can imagine this being pretty hard to detect on real hardware. One reason (among several possibilities) could be that CPU frontends themselves do their own strength reduction! That is, they detect when a multiply by an immediate uses a power of two and skip the multiplier unit entirely, opting instead to dispatch to the simpler ALU.

gsvic · 2022-03-16T04:35:35Z

gsvic
Mar 16, 2022

Flip Pass

Intro

This is my fork of llvm-pass-skeleton: https://github.com/gsvic/llvm-pass-skeleton/tree/vg292/lesson-7. My code can be found in the flippass directory.

Details

I implemented a very simple pass that flips the operator in a comparison instruction (CmpInst). Thus, the i < 7 would become i >= 7 and so on.

Test

Given a very simple program like ./test/Test.c, which contains a couple of if statements and printfs, in the form:

int i = 5;

if (i < 2) {
    printf("False\n");
}
else {
    printf("True\n");
}

In this case, without the flip-pass the output should be False. After the pass, it becomes True. For the whole test, the initial output is: False, False, True and after the pass, it becomes True, True, False.

Run

# Initial program
gcc ./llvm-pass-skeleton/test/Test.c

./a.out

False
False
True

# After the pass
clang -flegacy-pass-manager -Xclang -load -Xclang ./llvm-pass-skeleton/build/flippass/libFlipPass.so something.c ./llvm-pass-skeleton/test/Test.c

./a.out

True
True
False

1 reply

sampsyo Mar 17, 2022
Maintainer Author

Very nice; looks good!

chhzh123 · 2022-03-16T17:59:41Z

chhzh123
Mar 16, 2022

In this task, I implemented a memory access analysis pass for nested loops. I compiled and run my pass using the latest LLVM 14 branch with the new pass manager. My code can be found here.

Background

When C/C++ programs are lowered to LLVM IR, they lose high-level information of loops and memory accesses. The loops are translated into several basic blocks (preheader, body, latch, etc.). Memory access like A[i+1] may be translated into several lines of code below, which loses original index representation and poses challenges on implementing memory-related optimization using techniques like polyhedral analysis. Thus, in order to better analyze the memory dependency and optimize the access patterns in a loop nest, we need to first extract the load/store instructions with correct indices. In this task, I inspected LLVM IR and output the original array and indices. Since this is the preparation for the task in Lesson 8, it is not that fancy but just simple arithmetic expression propagation in the LLVM framework.

  %12 = load i32, i32* %i, align 4
  %add13 = add nsw i32 %12, 1
  %idxprom14 = sext i32 %add13 to i64
  %arrayidx15 = getelementptr inbounds [10 x i32], [10 x i32]* %A, i64 0, i64 %idxprom14
  %13 = load i32, i32* %arrayidx15, align 4

Implementation

To find the loops in LLVM IR, I reuse the LoopInfoWrapperPass. For each loop, I traverse the basic block and instructions inside it. I first find the getelementptr instruction, which is hard to understand the parameters in the first place, but actually its operands are just depicting the data types and the pointers to some specific positions in an array. For each getelementptr, I get its users. If the user is a load/store instruction, I traverse back from the %idx to get the complete expression of the index.

I created an instruction visitor class InstVisitor and implemented a dispatching method visit to traverse back from the final expression based on different arithmetic instructions, which is essentially doing an inorder tree traversal.

Test

To run my pass for real programs, I first use clang to generate bytecode (*.ll) and use opt to load the library. Moreover, -enable-new-pm=0 should be set for the new pass manager.

The following test program shows several different memory access patterns. Currently, I only consider 1D arrays, so high-dimensional arrays should be firstly flattened into 1D.

int main() {
    int A[10], B[10], C[10];
    // simple
    for (int i = 0; i < 10; ++i) {
        C[i] = A[i] + B[i];
    }
    // 1D stencil
    for (int j = 1; j < 9; ++j) {
        B[j] = A[j - 1] + A[j] + A[j + 1];
    }
    // complex indices
    for (int k = 1; k < 3; ++k) {
        C[k * 2] = A[k * 3 + 1] + B[k / 2];
    }
    // 2D: matmul
    for (int i = 0; i < 3; ++i)
        for (int j = 0; j < 3; ++j)
            for (int k = 0; k < 3; ++k)
                C[i * 3 + j] = A[i * 3 + k] * B[k * 3 + j];
    return 0;
}

The output of my pass is shown below.

Loop 0:
Load:   %28 = load i32, i32* %arrayidx52, align 4
Original: A[i37*3+k45]
Load:   %31 = load i32, i32* %arrayidx56, align 4
Original: B[k45*3+j41]
Store:   store i32 %mul57, i32* %arrayidx61, align 4
Original: C[i37*3+j41]

Loop 1:
Load:   %18 = load i32, i32* %arrayidx27, align 4
Original: A[k*3+1]
Load:   %20 = load i32, i32* %arrayidx29, align 4
Original: B[k/2]
Store:   store i32 %add30, i32* %arrayidx33, align 4
Original: C[k*2]

Loop 2:
Load:   %9 = load i32, i32* %arrayidx9, align 4
Original: A[j-1]
Load:   %11 = load i32, i32* %arrayidx11, align 4
Original: A[j]
Load:   %13 = load i32, i32* %arrayidx15, align 4
Original: A[j+1]
Store:   store i32 %add16, i32* %arrayidx18, align 4
Original: B[j]

Loop 3:
Load:   %2 = load i32, i32* %arrayidx, align 4
Original: A[i]
Load:   %4 = load i32, i32* %arrayidx2, align 4
Original: B[i]
Store:   store i32 %add, i32* %arrayidx4, align 4
Original: C[i]

We can see my pass exactly captures the four loops in the program (in a reversed order), and recovers the original arrays and indices for load/store instructions, even for complex indices or multi-dimensional nested loops. Notice some of the variables are renamed by LLVM, so there exist variable names like i37 or k45 in the result, but this does not affect the correctness.

Discussions

Though I finally made my pass work, I still have to say, the documentation of LLVM is reeeeally terrible. In the very beginning, I thought the tutorial page should reflect the latest changes. However, it didn't, and was even somehow misleading. After searching on Google, I could only find how to get the new pass manager to work on a discussion thread! Moreover, even the doxygen provides the class method signature, it is still hard to understand what the methods are used for and how to use them. Actually I found pulling the whole llvm-project and directly searching some methods offline may be more helpful. Looking at real code snippets with contexts before and behind function calls is much easier to understand than directly looking at the documentation.

Since I've been working on a MLIR project for several months, I can see lots of similarities between MLIR and LLVM, including how they traverse the function body and how the facilities look like. But I think the biggest difference between them may be the abstraction level. While LLVM IR is low-level and closed to real machines, many MLIR dialects are high-level and provide more facilities to operate on loops and memory accesses. At first, I wanted to do some real loop optimizations based on LLVM, but I found it was hard to print out readable code of loops and I could not easily convert the IR back to C/C++. However, with the affine/scf dialect in MLIR, we can dump the for loops in a human-readable C-like way, and it is also more intuitive to view and access the loops when doing optimizations like interchanging or fusing. After some programming practice in LLVM and MLIR, I kind of understand the rationale behind several layers of IR abstraction in nowadays deep learning compilers. Some machine-dependent optimizations are easier to express in a low-level abstraction, while loop or memory optimizations are easier to be implemented with a high-level IR, so progressive lowering is helpful to realize the optimizations in a suitable abstraction and gradually add hardware details to the program.

1 reply

sampsyo Mar 18, 2022
Maintainer Author

This all sounds good! Nice work; gathering up the DFGs that feed into getelementptr instructions is a useful way to gain insights into the array indexing behavior of a loop.

It's true that using opt is one way to run custom passes, at the expense of needing to do a three-step process (compile to bitcode; optimize; compile the rest of the way to machine code). The "holy grail" we're looking for here, if you're curious, is a way to inject a custom pass so it runs automatically during a clang invocation.

zzzDavid · 2022-03-17T02:15:30Z

zzzDavid
Mar 17, 2022

LLVM Pass: Global Value Numbering (GVN)

I implemented a Common Sub-expression Elimination (CSE) pass using GVN following this course blog.

To run the GVN pass:

clang -Xclang -load -Xclang ./llvm-pass-skeleton/build/skeleton/libSkeletonPass.so main.cc

The IR after the pass is printed in the console:

define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  %5 = alloca i32, align 4
  store i32 0, i32* %1, align 4
  store i32 0, i32* %2, align 4
  store i32 1, i32* %3, align 4
  %6 = load i32, i32* %2, align 4
  %7 = load i32, i32* %3, align 4
  %8 = add nsw i32 %6, %7
  store i32 %8, i32* %4, align 4
  %9 = load i32, i32* %4, align 4
  %10 = icmp sgt i32 %9, 1
  br i1 %10, label %11, label %12

11:                                               ; preds = %0
  store i32 %8, i32* %5, align 4
  br label %15

12:                                               ; preds = %0
  %13 = load i32, i32* %4, align 4
  %14 = add nsw i32 %13, 1
  store i32 %14, i32* %4, align 4
  br label %15

15:                                               ; preds = %12, %11
  ret i32 0
}

Notice that the add instruction generated from int d = a + b is eliminated, instead, there's only one store instruction store i32 %8, i32* %5, align 4 to write the addition result in register to d's pointer.

Testing

I developed my own test suite, and use LLVM's lit and FileCheck as testing tools. To run the test suite:

lit tests

In the tests, I use FileCheck's commands to check the common sub-expressions are indeed eliminated. For example,

// CHECK: add nsw i32 %{{[0-9]+}}, 2
// CHECK-NOT: add nsw i32
// CHECK: store i32 %{{[0-9]+}}, i32* %{{[0-9]+}}

The above checks states that there should not be an add instruction between add nsw i32 %n, 2 and store instructions. The expected lit test output is:

-- Testing: 6 tests, 6 workers --
PASS: test-suite :: for.c (1 of 6)
PASS: test-suite :: while.c (2 of 6)
PASS: test-suite :: basic.c (3 of 6)
PASS: test-suite :: if-both.c (4 of 6)
PASS: test-suite :: multiple.c (5 of 6)
PASS: test-suite :: if-single.c (6 of 6)

Testing Time: 0.63s
  Passed: 6

1 reply

sampsyo Mar 18, 2022
Maintainer Author

Awesome; looks very cool! I'd be interested to hear if any creativity or changes in plan were required in going from the blog post's description to a proper LLVM implementation. Since you didn't include it in the post, here's a link to your implementation if folks are curious what it looks like.

As an added bonus, it's cool that you figured out the lit/FileCheck tools for testing. As you may have figured out already, turnt is a simple clone of lit.

barabanshek · 2022-03-17T02:18:53Z

barabanshek
Mar 17, 2022

My code is here. I implemented a pass that inserts yielding points into busy-waiting loops in C programs.

Background

Yielding points allow the OS to re-schedule threads instead of keeping them running forever. This is very useful when implementing busy-waiting loops. Consider the following code:

void busy_wait() {
    while (do_run) {
    }
}

In this form, the loop will run until it sees do_run == true without giving the CPU time for other concurrently running threads and processes. A good way to re-write this loop would be by adding a yielding point as follows:

void busy_wait() {
    while (do_run) {
        std::this_thread::sleep_for(std::chrono::milliseconds(1000));  // Sleep thread for some time and allow the other threads to get scheduled
    }
}

In the second case, our busy-waiting loop will NOT take the whole CPU time and will allow some other threads to get scheduled on that CPU while waiting.

Implementation

My pass is very simple. It works as follows:

for every loop, use simple heuristics to detect that it is a busy-waiting loop
the heuristics are simple: the loop can only contain br and call instructions, and it is very short (less than 4 instructions)
when such a loop is found, we insert a call to the yield() function right after the first instruction in the loop header

Since I'm not sure how to insert calls to arbitrary functions, I wrote a helper function that implement the yield() functionality and in my LLVM pass, I'm looking for that function to get the right reference. Not the best way, definitely there should be better ways to get there.

Evaluation

For evaluation, I use this example program. It runs 4 busy-waiting threads and one benchmarking thread that measures its execution time. If we run this benchmark in a constrained environment (say by only giving 4 CPU cores to it with taskset), the benchmarking thread executes faster when the busy-waiting threads contain yielding points.

Example of the pass run (the code dump is the new busy-waiting loop with yielding() points:

> I saw a function called _Z5yieldv
> I found a busy wating loop in function _Z15background_waitv
> Instruction inserted call
  %2 = call zeroext i1 @_ZNKSt6atomicIbEcvbEv(%"struct.std::atomic"* @do_run) #3
  call void @_Z5yieldv().   // yielding point
  br i1 %2, label %3, label %4
  br label %1

Some measurements:

// Run without the LLVM pass
clang++ test.cc -pthread
./a.out
> took 2217123 us

// Run with the LLVM pass
cang++ -Xclang -load -Xclang llvm-pass-skeleton/build/skeleton/libSkeletonPass.* test.cc -pthread
./a.out
took 1637150 us

1 reply

sampsyo Mar 18, 2022
Maintainer Author

Sounds awesome; this is very creative! Nice work.

Since I'm not sure how to insert calls to arbitrary functions, I wrote a helper function that implement the yield() functionality and in my LLVM pass, I'm looking for that function to get the right reference. Not the best way, definitely there should be better ways to get there.

It's not too different from inserting calls to functions you write yourself. The only trick in this case is that you're calling a C++ function, so you'd need to use the mangled name to look up the function to call.

atucker · 2022-03-17T04:56:39Z

atucker
Mar 17, 2022

Pong Instrumentation

Sorry to be late!

My code is here.

Implementation

Since the assignment said to try to run it on real C code, I found an implementation of pong to use since it's a relatively simple program, where it is easy-ish for me to tell that it is still working. I needed to install a graphics library called SDL for this pong implementation to work. After looking through the generated LLVM IR instructions it seemed like there were a bunch of calls to SDL, so I decided that a relatively ecologically valid task would be to try to measure how long we spend in an SDL call.

I followed the tutorial through the rtlib section so that I was able to instrument the code by adding a function that logs the start of a call to SDL, and the end of a call to SDL. Basically the start just logs when we started, and the end logs when we finished, adds the difference to a global counter, and then every quarter second of CPU time prints out where we are.

I didn't think that much about how to exclude my instrumentation logic from the timing, so if it's adding a lot of overhead then the estimates are off. Also I think I'm measuring CPU time instead of wall time, but I'm not entirely sure.

Testing

Pong seems to still run without crashing, and the code gives results like this:

Processor Time 236 ms 52% in SDL: Spent 123 ms in SDL, and 113 ms elsewhere
Processor Time 487 ms 49% in SDL: Spent 239 ms in SDL, and 248 ms elsewhere
Processor Time 738 ms 47% in SDL: Spent 351 ms in SDL, and 387 ms elsewhere
Processor Time 990 ms 46% in SDL: Spent 461 ms in SDL, and 529 ms elsewhere
Processor Time 1241 ms 46% in SDL: Spent 575 ms in SDL, and 666 ms elsewhere
Processor Time 1492 ms 45% in SDL: Spent 685 ms in SDL, and 807 ms elsewhere
Processor Time 1743 ms 45% in SDL: Spent 799 ms in SDL, and 944 ms elsewhere
Processor Time 1994 ms 44% in SDL: Spent 896 ms in SDL, and 1098 ms elsewhere

Challenges

Oof. There were challenges. The vast majority of my time spent on this project was in trying to make an LLVM pass that modified the output at all, with pretty smooth sailing after that works.

The two big challenges were:

Realizing that I should use IRBuilder to put together the calls.
Putting together a compilation and linking instruction which

My two big lessons were:

If something stops working, go back to testing it on examples that you're the most confident should work, and only move forward once the thing that you're trying works there. For example, I spent a bit of time forgetting to add the -flegacy-pass-manager argument to my pong compilation, and I figured this out by going all the way back to realizing that I wasn't able to get the LLVM pass to do anything even on the code from the very beginning of the tutorial.
Write and use scripts. The entire project went much easier after I wrote the build.sh scripts rather than trying to get the commands right every time or find them in my bash history.

1 reply

sampsyo Mar 18, 2022
Maintainer Author

Nice; super cool!! This was a creative idea. I like the idea of being inspired by the real libraries that the Pong application called into. This kind of simple timing breakdown seems like something an SDL user could actually benefit from. Good work struggling through the logistics of making the pass work on a whole project!

orkosinha · 2022-03-17T05:19:19Z

orkosinha
Mar 17, 2022

My implementation is here.

Honestly, there were a good amount of failures with this one, but I'll get to this later.

Installing clang was fairly straight forward. I actually use a Docker container running Debian for this class, so installation was even more straight forward, as I don't have a lot of things on the container besides Bril specific utilities.

I wanted to make a pass that adds together constants, so I started hacking this solution together before really checking what example programs would emit in LLVM IR. So my solution was to check if an add operation had two constants, and add them together. What I failed to realize was the IR actually looks like this for

#include <stdio.h>

int main() {
  int i = 42 + 42;
  int j = i + i;
  int k = j + j;
  printf("Sum = %d\n", k);
  return 0;
}

define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  store i32 0, i32* %1, align 4
  store i32 84, i32* %2, align 4
  %5 = load i32, i32* %2, align 4
  %6 = load i32, i32* %2, align 4
  %7 = add nsw i32 %5, %6
  store i32 %7, i32* %3, align 4
  %8 = load i32, i32* %3, align 4
  %9 = load i32, i32* %3, align 4
  %10 = add nsw i32 %8, %9
  store i32 %10, i32* %4, align 4
  %11 = load i32, i32* %4, align 4
  %12 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str, i64 0, i64 0), i32 %11)
  ret i32 0
}

This is a problem, because even though I'm adding two constants this automatically gets folded into storing 0 and 82, and then summing them, so operands are never the ConstantInt type I was hoping for. I've been reading LLVM docs for a while, but not sure how to either modify my code to support addition folding.

I also installed perf to check wall clock time, but obviously no optimizations are being made so not much difference.

The end result is an unambitious print something if two numbers are added. In my case, I'm just printing the instruction. I'm wondering if anyone else ran into this problem?

1 reply

sampsyo Mar 19, 2022
Maintainer Author

Indeed; there are two things going on here that made your simple idea harder than it looked:

Clang does its own constant folding while emitting LLVM IR. So some "obvious" expressions, like 42 + 42, never get emitted as proper add instructions or whatever—they get immediately folded to 84 in the frontend. The reason for this is compiler speed. It's something the frontend can easily do without much complexity in the optimization, and it means that Clang generates smaller code than it would otherwise. This speeds up the whole downstream compiler—all optimizations go faster when they see less code.
Seemingly contrarily, Clang emits intentionally very inefficient code for variable accesses. There are all those load and store instructions even though the original code never uses memory. This way, Clang itself doesn't have to worry about generating code in SSA form; it can mutate memory locations freely according to what the code does. Then a common LLVM pass, mem2reg, opportunistically hoists memory locations into LLVM "registers" (SSA slots) when they are non-escaping.

I hope this helps explain what you were seeing! One thing that can make simple LLVM passes easier to write is to require that you run mem2reg first on this code so it's easier to process (and in non-trivial SSA form).

susan-garry · 2022-03-17T07:23:46Z

susan-garry
Mar 17, 2022

My implementation is here. I took a similar approach as @orkosinha, and set out to implement constant folding on multiplication instructions. Okay, I actually set out to implement constant folding on add instructions as well, but discovered there would be slightly fewer lines of code involved if I went with mul instructions instead.

I ran into similar problems - intuitively, I expected constants to be represented as their own objects (similar to bril), but the intermediate representation actually uses stores and loads when dealing with variables that are assigned constants. In the following example, we store 15 somewhere and then load it into a new "register" so that we can multiply it by 1.

int v1 = 15;
int v2 = 1 * v1;

store i32 15, i32* %2, align 4
%6 = load i32, i32* %2, align 4
%7 = mul nsw i32 1, %6

Additionally, I realized that LLVM already performs some constant folding for us! So

int v1 = 3 * 5;
int v2 = 1 * v1;

Is represented in the IR exactly like the instructions above; LLVM (or clang?) computed 3*5 = 15 before we begin processing the code. However, it does not perform constant propagation. So this is what I set out to do.

My local propagation pass keeps track of each store instruction that stores a constant. If a load instruction loads from a location where a constant was last stored, then replace every use of the load instruction with the constant. I keep track of which pointers contain constants using a map.

To actually perform constant folding on mul instructions, I utilize the getUniqueInteger() function under the Constant object, which returns an APInt pointer, and there are many binary functions in APInt that can be used to compute simple arithmetic operations, including operator*(). operator+=() modifies an APInt object in place, but there is no addition operator which returns a new APInt object, which is why I decided to work on mul instructions first.

Manually inspecting the IR before and after my pass confirms that it is indeed replacing uses of load instructions with the appropriate constants, but my implementation leaves you with a lot of dead code. First, I ran into an issue where this for loop

for (auto& U : instr->uses()) {
    User* user = U.getUser();
    user->setOperand(U.getOperandNo(), c);
}

leaves me with an infinite loop. I have no idea why; replacing it with instr->replaceAllUsesWith(c); works perfectly fine.

Additionally, instr->eraseFromParent(); subsequently leads to a segfault. I have not yet been able to figure out why... has anyone else run a similar issue?

2 replies

sampsyo Mar 19, 2022
Maintainer Author

Indeed; please check out a longer explanation in #294 (reply in thread), but Clang itself does "obvious" constant folding while generating LLVM IR.

My local propagation pass keeps track of each store instruction that stores a constant. If a load instruction loads from a location where a constant was last stored, then replace every use of the load instruction with the constant. I keep track of which pointers contain constants using a map.

Sounds promising! I imagine the answer is probably "no," which is fine, but did you consider the possibility of aliasing? That is, a different intervening store might have changed the value in the location in between the original store and the subsequent load.

sampsyo Mar 19, 2022
Maintainer Author

Interesting problems with setOperand in the loop and eraseFromParent. For the former, one possibility is that you're changing the set of uses while iterating over them—which would invalidate the iterator. "Buffering" the uses into a vector and then iterating over that could help.

tonyjie · 2022-03-18T01:16:58Z

tonyjie
Mar 18, 2022

My code is here

Implement a Pass

I choose to implement an unambitious pass. The DivToMul pass would inspect the integer division instruction (OpCodeName == 'sdiv'), and transform the instruction with an integer multiplication instruction.

This was achieved by using IRBuilder and CreateMul method.

After building the pass, use the command as follows to run it.

clang++ -Xclang -load -Xclang build/DivToMul/libDivToMul.so -flegacy-pass-manager simple.cpp -o simple_trans
./simple_trans

When compiling, the Pass would print OpCodeName and Instruction when it find an Integer Division instruction, and print the Integer Multiplication Instruction after transformation.

When executing, we can see the result of 10 / 2 and 8 / 2 become 20 and 16. So we implement the pass successfully!

int x = 2;
int y = 10;
int z = 8;

int res0 = y / x;
int res1 = z / x;
printf("res0 = %d; res1 = %d;\n", res0, res1);

Output:

res0 = 20; res1 = 16;

Testing

I just wrote another simple program test.cpp with a test_div function with a if branch.

The running command is the same.

clang++ -Xclang -load -Xclang build/DivToMul/libDivToMul.so -flegacy-pass-manager test.cpp -o test_trans

This will print the Integer Division Instruction and the transformed instruction.

./test_trans will give the multiplication result instead of division result.

Discussion

I was planning to transform the floating division instruction to floating multiplication instruction (fmul), but it seems that the IRBuilder.CreateMul() cannot generate a fmul instruction. The output of two floating points is not as expected. I think this is because that CreateMul() still generates a integer multiplication instruction, while its operands are floating points. then something went wrong.

I tried, but didn't found a function like CreateFMul() for IRBuilder.

1 reply

sampsyo Mar 20, 2022
Maintainer Author

Cool; all sounds good!

it seems that the IRBuilder.CreateMul() cannot generate a fmul instruction.

Indeed; for that, you will want CreateFMul:
https://github.com/llvm/llvm-project/blob/9d6a6fbbbde983aede6805b8c9dedf410b9dfaad/llvm/include/llvm/IR/IRBuilder.h#L1453

One other minor thing: instead of doing a string comparison for the opcode, as you do here:
https://github.com/tonyjie/cs6120-hw/blob/112c5eab4fd127ccee6774e9f9533bb310edd9bc/lesson7/DivToMul/DivToMul.cpp#L22

It would be a bit safer and more efficient to directly use the opcode enum, i.e., I->getOpcode() == Instruction::Mul.

michaelmaitland · 2022-03-20T18:53:08Z

michaelmaitland
Mar 20, 2022

Introduction

I decided to implement the path profiling algorithm presented in the Ball paper in LLVM. This ended up being a much larger project than I expected. I didn't realize that the algorithm to construct the Inc(e) function was a separate algorithm from another Ball paper.

Because this was an LLVM focused project, I spent most of my time focusing on the LLVM related parts of this project. As a result, I did not implement the Inc(e) function and instead gave dummy values. I understand that this means I did not really implement efficient path profiling, but in pursuing this project I did result in improving my understanding of LLVM greatly. Although my code splits out profiling information, the information it spits out is not correct. However, I believe it would be correct if the Inc(e) function generation was done correctly.

Implementation

There were five main parts to this project. The first was to assign values to the edges, the second was to get the chords of the maximum spanning tree, to assign increments to the chords, to initialize instruments, to place instruments, and to dump instruments.

Assigning Values to the CFG Edges

This part of the algorithm assigns values to CFG edges based on the number of paths through them.

std::map<std::pair<BasicBlock*, BasicBlock*>, int> assignEdgeVals(Function &F) {
      std::map<BasicBlock*, int> numPaths;
      // Represent edges as pair between blocks
      std::map<std::pair<BasicBlock*, BasicBlock*>, int> edgeVals;
      // Visit in reverse topological so all successors
      // visited before processing block itself
      for(BasicBlock *v : reverseTopological(F)) {
          int numSuccs = v->getTerminator()->getNumSuccessors();
          if(numSuccs == 0) {
            numPaths[v] = 1;
          } else {
              numPaths[v] = 0;
              for(auto *it = succ_begin(v); it != succ_end(v); it++) {}
                  BasicBlock *w = *it;
                  std::pair<BasicBlock*, BasicBlock*> e(v, w);
                  edgeVals[e] = numPaths[v];
                  numPaths[v] = numPaths[v] + numPaths[w];
              }
          }
      }
      return edgeVals;
    }

This part of the algorithm required me to understand how LLVM CFGs work. At first I did not use the succ_begin or succ_end method of iteration and instead got the last instruction in the block and got its successors directly. I believe that either approaches are valid because there could be no jumps anywhere earlier than the last instruction in the basic block by definition. However, the approach I use now seems more canonical since I do not have to enter a different level of abstraction (Instructions) to accomplish the same task. It wasn't entirely clear that the succ_begin method even existed by reading the docs because it was only listed as a reference to a method in the BasicBlock specification and not as a method belonging to BasicBlock (https://llvm.org/doxygen/classllvm_1_1BasicBlock.html#a79c007dcf9fff57e1569e778d7885b5e).

This part of the algorithm also required me to implement reverse topological ordering of basic blocks. In Bril, we could identify blocks by name. In LLVM, it seemed easiest to identify BasicBlocks by their pointer address:

void topological(BasicBlock *B, std::vector<BasicBlock*> visited, std::stack<BasicBlock*> s) {
        // Visit this block
        visited.push_back(B);

        // Visit successors
        int numSuccs = B->getTerminator()->getNumSuccessors();
        for(int i = 0; i < numSuccs; i++) {
          BasicBlock* succ = B->getTerminator()->getSuccessor(i);
          // Only visit if we have not visited before
          if (std::find(visited.begin(), visited.end(), succ) != visited.end()) {
            topological(succ, visited, s);
            s.push(succ);
          }
        }
    }

Testing assigning Values to the CFG Edges

In order to test this, I constructed some simple C++ programs that had different CFG structures and wrote to errs() the edge weights. It wasn't perfectly ideal to use this as a testing strategy since I did not have a way to easily identify which pointer address belonged to which BasicBlock. I ended up making sure that all the edge weights outputted corresponded to the expected edge weights. In the future, perhaps it would make sense to carry along block names for ease of testing. Since this project was quite large I settled with how I had tested and moved forwards.

Get Chords

This part of the project required me to add dummy edges from the exits to the entry. In the paper, there was only one exit. But LLVM programs can return from multiple program points. I carefully wondered whether I must add an edge from each exit to the entry or have all exits go to a common exit and that common exit to the entry. After considering it for a while, I believe that either method would work, but the second approach is redundant. Since there is no edge weight to these dummy edges in calculating the maximum spanning tree they would have a de facto weight of 0. Therefore either approach has no impact on the max spanning tree.

I opted to implement the first approach. Then I negated all edges, ran Kruskals, got the minimum spanning tree back, and negated the edges back to normal. In my implementation, I implemented all parts of Kruskals using a naive check all edges for a cycle and add if no cycle. I did not complete my implementation of checking all cycles, but believe the kruskals algorithm is correct once that is implemented.

Next, to find the Chords I iterated over all edges and added them to a chord list if they were not part of the max spanning tree.

Assigning Increments

The next part of this was to assign increments. Ball presents a worklist style algorithm which I implemented. However there is a part to this algorithm that actually assigns the increments that is part of another paper of his. I skimmed the paper but did not implement this in my code due to time. Instead, I opted to make up fake increment values.

Initializing Increments

Now we get onto the fun LLVM related parts of this project. From here on is where I spent most of my time. This part requires us to initialize an array of counters:

 void initInstruments(Function& F, std::map<std::pair<BasicBlock*, BasicBlock*>, int> incs) {
        // No instruments
        if(incs.size() == 0) return;

        BasicBlock* entry = &F.getEntryBlock();
        Instruction *ip = &*(entry->getFirstInsertionPt());
        LLVMContext &LLVMContext = F.getContext();
        IRBuilder<> Builder(LLVMContext);
        Builder.SetInsertPoint(ip);

        Type *elType = Type::getInt32Ty(LLVMContext);
        Type* aType = ArrayType::get(elType, incs.size());

        AllocaInst *Alloca = Builder.CreateAlloca(aType, 0, "pp");
    }

Although this function is short and concise, I spent a lot of time constructing it. First I used this documentation to create an AllocaInst. It turns out that this documentation is out of date. First of all, the way they retrieve the Integer type is no longer viable. I had to do some research reading the docs on how this should be done. The second part that gave me some trouble is that the AllocaInst constructor takes in a Ty and an ArraySize but makes no explanation of whether Ty is the type of the array (an array of ints) or the type of element. Nor does it explain whether ArraySize is the number of elements in the array or the total number of bytes (num elements * elementSIze). Google had some more outdated stackoverflow posts that were either outdated or presented conflicting information. I then remembered the IRBuilder from the blog and saw that CreateAlloca was much simpler. It was clear that I needed to pass in an ArrayType which contained the element type and size of the array. Next time, I'll be sure to use the IRBuilder when possible because it was a much simpler API (makes sense... the builder is supposed to make my life easier).

The next interesting part of inserting this was figuring out where to insert. I used the getFirstInsertPt function of the block. I'm not sure whether this returns the first point where we can insert an instruction before or the first place where we can put an instruction after. The IR Builder inserts before the given inst, so I hope that getFirstInsertPt gives us the correct point. Some documentation on getFirstInsertPt would have been quite helpful.

Placing Instruments

The next step was to place the actual instrumentation. The Ball algorithm increments along an edge. But LLVM instructions occur within BasicBlocks, not on edges. To combat this I had to insert PHI nodes on all blocks that had instrumentation on the in edge. One concern was that not all edges into a block did any incrementing. In this case I had to increment the counter by zero. All other incs would do the nonzero increment as per the (fake) inc function. On blocks that had no instrumented edges entering we did not insert a PHI node.

At first I couldn't figure out how to create a phi node -- the constructor did not take in all the information required to create a phi node. Specifically, it did not take in the blocks and values to use. This seems like a poor design choice of the LLVM creators. Creating a phi node with some but not all required information and then calling methods to fill in the rest of the information felt wrong. Perhaps a PhiBuilder object would be a good addition to LLVM or even an improvement to the IRBuilder.CreatePHI function would be nice.

Another thing that gave me trouble when creating PHI nodes was where to get the INC values from. Did I need to create an instruction in the previous block that put the inc value into a register? I ended up being able to pass a ConstantInt straight to the phi instruction but it wasn't clear whether LLVM would use that constant directly or generate another instruction to use that constant. It ended up being able to use the constant which is quite a nice touch. But in the assembly generation I wonder whether we will need to create lots of constants and put them into registers or whether an optimization pass can clean this us since we likely only need one register containing a zero and only a few registers containing the real inc values. One thing that Ball does not address is how much overhead instrumentation adds to real assembly since he gets to increment on idealized edges that do not exist in assembly.

Dump Values

The last step was to dump increment values. I did not get to implementing this but I assume that I would write some C++ code, and have LLVM insert that, similar to how the blog uses logop. This code could either print using cout or write to a file. However in thinking about it I'm not sure if we could actually use C++ because we need to refer to LLVM array's and I'm not sure we can do that. So maybe this would be something we have to write in llvm by hand (ew).

Reflection

Overall, I was a little bit upset with how much I got this to "work". It really turned out to be a much bigger project than I expected. I can't discuss or present any evaluation because of this since I did not implement all parts. However, as someone who will be an LLVM contributor next year I found this project to be extremely valuable. I've seen some of the things that are easy in LLVM, some of the things that are hard, and have had an opportunity to think on what could be improved. I hope that I can take this experience with me and do something with what I've learned from this project.

1 reply

sampsyo Mar 28, 2022
Maintainer Author

Really nice work, @Yasgur99! This was an extremely ambitious thing to shoot for, and there are a lot of moving parts here in the LLVM implementation even if picking the increments correctly didn't exactly work out.

On the topic of documentation, outdated SO posts, and the like: yeah, this is totally a downside to using LLVM for stuff. It evolves so quickly that much advice written about it quickly becomes out of date. However, I think it can often be instructive to go directly to the source code instead of relying on answers from SO or whatever. The getFirstInsertionPt method, for which you were wishing for better docs, is a pretty good example because it's so short and readable (as many small LLVM utilities often are):
https://github.com/llvm/llvm-project/blob/f863df9a051095191d9ff63fbf97a12c80cd2c54/llvm/lib/IR/BasicBlock.cpp#L252-L260

With this, you can kinda write the docs yourself: "gets the position after the phi-nodes and any landing pads."

yy665 · 2022-03-21T11:22:13Z

yy665
Mar 21, 2022

My code is here.

Sorry for being super late. I had a more ambitious target of building a generalized loop unrolling algorithm by finding the greatest common predicate, but somehow I couldn't get the algorithm work properly. So instead for this assignment I wrote a very unambitious pass which tries to canonicalize binary computations by simply swapping the two arguments if the first argument is larger than the second. It's probably not useful since in these cases we could simply do constant folding instead.

I tested my code on the PI solver and also some dummy .c codes in my repo. I tested by manually observing the LLVM IR after running the pass. I was hoping that my loop unrolling could work properly so that I could have done some more interesting testings.

The biggest obstacle I had was not even LLVM itself. I am working on Mac OSX with M1 chip. I had some issues when building and running LLVM. I thought the hardware was the issue but it turned out that I had duplicated libraries and when I upgrade the XCode tools it would rename some of the library links so I had some subtle errors which didn't seem to relate to this problem.

2 replies

sampsyo Mar 28, 2022
Maintainer Author

Thanks for the summary. Not that it's critical at this point, but in the future, it would be useful to hear a few more details about the challenges you encountered. For example:

I tested by manually observing the LLVM IR after running the pass.

This summary raises questions about the details, like: what did you look for when manually inspecting the IR? Was there something stopping you from comparing the programs' behavior to check for correctness?

I had some issues when building and running LLVM.

Similarly, this raises questions like: what were the errors, exactly? How did you resolve them? Including details like this might help future LLVM hackers who encounter similar problems to resolve them more quickly.

yy665 Mar 28, 2022

Thanks for the comments! I agree I do need to include more useful details on the challenges I have. I will update last week’s summary with details.

thomasyang18 · 2022-06-28T13:41:32Z

thomasyang18
Jun 28, 2022

I wrote a very unambitious "pass" where the gimmick was you replaced every store instruction that used a constant integer value to a random integer value. So this pass actively hinders the program, but I didn't want to go and implement an actual optimization because I was tired from the past few lessons, so I just wanted to get it over with.

It was still valuable though - I spent a decent bit crawling through the docs and learning some of the parts of LLVM. Some things were unintuitive - for example, I thought that assignment operators would be binary expressions (operator being "="), but instead they were under "memoryops", which makes sense when you think about it in terms of machine instructions and lower level stuff, but from a high level perspective was weird (since in the expression, "x = y = z = 3", to make the AST, you would treat it as a binary operator, at least from my previous implementations of it). And then I thought that the assignment instruction was load instead of store, which was another rabbithole.

I "tested" this program by writing solutions to some coding problems, and sure enough the programs all broke - some just instantly terminated, some printed a bunch of random garbage, others segfaulted or infinite looped. I also went through the IR and checked that random values were getting stored every store instruction, which was the case.

One observation I had when playing around with LLVM is that (at least without optimizations), LLVM rarely has phi instructions does it? I wasn't able to find any for the short programs I generated (which had loops and were decent complicated programs with data structures and stuff). Instead, they had a lot of "store" instructions, which at first I thought was mutating state, but it seems like it's their implementation of removing the phi nodes, by storing them into an instruction right before the merge. Is that true?

1 reply

sampsyo Jun 29, 2022
Maintainer Author

The LLVM code that Clang generates is phi-free! It is only nominally in SSA form because everything goes in (mutable) memory. The all-important mem2reg pass is responsible for moving values into registers where possible, introducing phi instructions.

Lesson 7: LLVM #294

sampsyo Feb 18, 2022 Maintainer

Replies: 18 comments · 20 replies

Getting Started with LLVM (Lesson 7)

Testing

Experience

sampsyo Mar 13, 2022 Maintainer Author

Summary

Testing

Challenges

sampsyo Mar 17, 2022 Maintainer Author

Summary

Testing

Experience

sampsyo Mar 17, 2022 Maintainer Author

Summary

Why deoptimize?

Testing

Experience

sampsyo Mar 17, 2022 Maintainer Author

sampsyo Mar 17, 2022 Maintainer Author

Implementation

Testing

sampsyo Mar 17, 2022 Maintainer Author

Summary

Evaluation

Experience

sampsyo Mar 17, 2022 Maintainer Author

Flip Pass

Intro

Details

Test

Run

sampsyo Mar 17, 2022 Maintainer Author

Background

Implementation

Test

Discussions

sampsyo Mar 18, 2022 Maintainer Author

LLVM Pass: Global Value Numbering (GVN)

Testing

sampsyo Mar 18, 2022 Maintainer Author

Background

Implementation

Evaluation

sampsyo Mar 18, 2022 Maintainer Author

Pong Instrumentation

Implementation

Testing

Challenges

sampsyo Mar 18, 2022 Maintainer Author

sampsyo Mar 19, 2022 Maintainer Author

sampsyo Mar 19, 2022 Maintainer Author

sampsyo Mar 19, 2022 Maintainer Author

Implement a Pass

Testing

Discussion

sampsyo Mar 20, 2022 Maintainer Author

Introduction

Implementation

Assigning Values to the CFG Edges

Testing assigning Values to the CFG Edges

Get Chords

Assigning Increments

Initializing Increments

Placing Instruments

Dump Values

Reflection

sampsyo Mar 28, 2022 Maintainer Author

sampsyo Mar 28, 2022 Maintainer Author

sampsyo Jun 29, 2022 Maintainer Author

sampsyo
Feb 18, 2022
Maintainer

Replies: 18 comments 20 replies

sampsyo Mar 13, 2022
Maintainer Author

sampsyo Mar 17, 2022
Maintainer Author

sampsyo Mar 17, 2022
Maintainer Author

sampsyo Mar 17, 2022
Maintainer Author

sampsyo Mar 17, 2022
Maintainer Author

sampsyo Mar 17, 2022
Maintainer Author

sampsyo Mar 17, 2022
Maintainer Author

sampsyo Mar 17, 2022
Maintainer Author

sampsyo Mar 18, 2022
Maintainer Author

sampsyo Mar 18, 2022
Maintainer Author

sampsyo Mar 18, 2022
Maintainer Author

sampsyo Mar 18, 2022
Maintainer Author

sampsyo Mar 19, 2022
Maintainer Author

sampsyo Mar 19, 2022
Maintainer Author

sampsyo Mar 19, 2022
Maintainer Author

sampsyo Mar 20, 2022
Maintainer Author

sampsyo Mar 28, 2022
Maintainer Author

sampsyo Mar 28, 2022
Maintainer Author

sampsyo Jun 29, 2022
Maintainer Author