Replies: 18 comments 20 replies
-
Getting Started with LLVM (Lesson 7)To get started with LLVM I implemented a simple LLVM pass that goes over each instruction in a program and checks if the instruction is commutative. If so, it swaps the order of instructions. The pass is in TestingI implemented a simple C program to check the effect of the pass in [ First build the pass.
Test the pass
Optionally, build the processed LLVM and compare the output.
ExperienceI found LLVM infrastructure to be pretty good, other than installing it, of course. Once I figured out the |
Beta Was this translation helpful? Give feedback.
-
SummaryMy code is here: llvm. I implemented a simplified version of what inling would look like, except that to make it simpler to implement, I limited the function to inline into to be main, and functions that would be inlined could only be 1 basic block long and have exactly 1 return instruction. Further, the functions to inline cannot contain any other calls to any other functions. All these conditions trivialize the pass I have written. To make it more robust, one might want to consider functions that have multiple basic blocks, many exit point and possibly has calls to other functions, that does not form a loop in the call graph. I did this using a module pass, where I first read over all functions and check if it is main or not. For functions that are not main, I check if they satisfy the conditions I mentioned previously. For those functions that satisfy the conditions I wanted, I then inline the function into the main function, by setting up a builder, then copying instructions one by one. I make sure to change the arguments of the functions to be the values from the main function, and then I set the return function's arguments as well to change its uses to be the correct ones in the main function. TestingI tried the program on a simple test case, where the function to be inlined was a simple addition function, that adds 2 integers and returns the sum. This addition function can be seen in I then try on a real world program, specifically on conway's game of life. I use an implementation from https://rosettacode.org/wiki/Conway%27s_Game_of_Life#C, and all credit should go to this source. Running ChallengesLLVM was challenging to work with. Reading the LLVM documentation was tricky, as I often had to search various LLVM doxygen pages until I found the information needed. In terms of using LLVM, there were several troubles I encountered. First, I realized late on that I needed to clone instructions, rather than to reuse them. Next, I also realized I had to insert instructions after iteration, because that would avoid changing the iterator. I often forgot about checking for terminator blocks, which made some bugs initially occur. Removing instructions is also difficult, as one needs to remove all of the uses of that instruction before the instruction can be deleted. Finally, trying to figure out how to set up a module pass was tricky. I searched online, found a stackoverflow post, and that post finally led me back to Issue #7 on the llvm for grad students repository. |
Beta Was this translation helpful? Give feedback.
-
Code here SummaryI decided to use LLVM to instrument a program to generate a memory trace. I have previously done this using Intel PIN as a TA for CS3410, as we needed a realistic multithreaded memory trace. I followed the tutorial pretty closely, and used the documentation to look up how to identify load and store instructions. For each of these, I inserted a call to a logging function which I implemented in C. This function prints threadID, read/write, and the address to TestingI ran the multithreaded matrix multiplication I used in 3410 through, and it seemed to work fine. The original execution still worked, and the trace looked reasonable. ExperienceThe experience of using LLVM was much more enjoyable and painless than using PIN (not to say PIN is bad, but it was nice to operate at a slightly higher level). That said, the memory trace is only accurate for non register allocated code, since LLVM hasn't yet gotten things off the stack and into registers. One challenge I faced was that LLVM is a little more strict than C about pointer types. To produce the trace, i wanted to print the actual memory address, meaning it could be any pointer type (I used |
Beta Was this translation helpful? Give feedback.
-
SummaryMy code can be found here. I basically implemented Walkthrough of implementation: Traverse through all blocks. If block
Why deoptimize?Not many good reasons. Maybe some target instruction set doesn't have integer multiplication? It can also be used for testing certain optimizations, especially ones which might turn repeated addition into multiplication... TestingI found a matrix multiplication program ExperienceLLVM and Clang was rather surprisingly easy to setup. I just downloaded them from my OS's package manager (pacman on Manjaro Linux), and voila they basically worked out of the box besides the stderr not showing up at first. One thing that was actually quite helpful was the type system involved in llvm. My editor would complain frequently about my code when it was wrong. One thing that was difficult is to figure what a function does from its name alone. For example,
|
Beta Was this translation helpful? Give feedback.
-
My code is here. I went for the unambitious route: I print a message every time I encounter a load instruction! I modified Skeleton.cpp as needed, wrote a simple runtime library that is only compiled once, and generated a few example cases. You may notice an attempt to use Turnt; this was eventually scaled down. Turnt is now just used to generate .o files for all the example cases in one go. Sadly it is still necessary to link the .o files and run the exe files by hand. You may also notice that my examples have a float/int division slant. Indeed, my initial goal was to print out one thing for integer divisions and another thing for float divisions. Doesn't seem all that hard, and I'm sure I will be shown how easy it is, but I did get a bit lost in the LLVM jungle. I found it tricky to eke out the kind of operand that a Binary Operator was working with. In any case, I ended up going with |
Beta Was this translation helpful? Give feedback.
-
Code is here. ImplementationMy (unambitious) implementation involved inserting a print statement saying "I saw a plus!" after every integer add operation. I did this by attempting to cast each instruction as BinaryOperator, and if that was successful, I checked if the instruction's opCode was equal to 13. If it was, I added inserted a call to a function that would print "I saw a plus!" (in the same way we did with rtlib.c in lecture). TestingI tested my implementation on a merge sort example, which seemed to work correctly. |
Beta Was this translation helpful? Give feedback.
-
My LLVM pass is here. SummaryI implemented a simple pass that converts multiplies by a power of 2 constant to a shift left. On some hardware shift left is significantly faster than multiply. Implementing this pass was pretty straight forward, but required using some LLVM utilities to determine if the multiply included a constant power of 2. EvaluationI wrote a simple c program that multiplies numbers from 0 to 1024 by 8 to confirm that my optimization works. Unfortunately, even when increasing the number of multiplies, I wasn't able to see a noticeable difference in performance with my optimization. I believe this comes down to the specific details of the intel x86 implementation, as well as optimizations that occur during llvm code generation. To help limit the optimization during llvm code generation, I changed the multiply to a non-power of 2 and still could not notice any improvement in overall performance. One possibility is that the performance hit of multiplication was hidden by the loop instructions because of how out-of-order processors work. ExperienceThe most challenging part of this experience was getting a working combination of llvm and clang. It appears that the default versions of llvm and clang on arch linux have slightly different build options that prevented the skeleton code from working. I ended up having to build and install llvm and clang from scratch which put a pretty severe load on my laptop. I was able to limit that by building in release mode which significantly reduces the amount of memory needed during linking. |
Beta Was this translation helpful? Give feedback.
-
Flip PassIntroThis is my fork of llvm-pass-skeleton: https://github.com/gsvic/llvm-pass-skeleton/tree/vg292/lesson-7. My code can be found in the flippass directory. DetailsI implemented a very simple pass that flips the operator in a comparison instruction ( TestGiven a very simple program like
In this case, without the flip-pass the output should be Run
|
Beta Was this translation helpful? Give feedback.
-
In this task, I implemented a memory access analysis pass for nested loops. I compiled and run my pass using the latest LLVM 14 branch with the new pass manager. My code can be found here. BackgroundWhen C/C++ programs are lowered to LLVM IR, they lose high-level information of loops and memory accesses. The loops are translated into several basic blocks (preheader, body, latch, etc.). Memory access like %12 = load i32, i32* %i, align 4
%add13 = add nsw i32 %12, 1
%idxprom14 = sext i32 %add13 to i64
%arrayidx15 = getelementptr inbounds [10 x i32], [10 x i32]* %A, i64 0, i64 %idxprom14
%13 = load i32, i32* %arrayidx15, align 4 ImplementationTo find the loops in LLVM IR, I reuse the I created an instruction visitor class TestTo run my pass for real programs, I first use The following test program shows several different memory access patterns. Currently, I only consider 1D arrays, so high-dimensional arrays should be firstly flattened into 1D. int main() {
int A[10], B[10], C[10];
// simple
for (int i = 0; i < 10; ++i) {
C[i] = A[i] + B[i];
}
// 1D stencil
for (int j = 1; j < 9; ++j) {
B[j] = A[j - 1] + A[j] + A[j + 1];
}
// complex indices
for (int k = 1; k < 3; ++k) {
C[k * 2] = A[k * 3 + 1] + B[k / 2];
}
// 2D: matmul
for (int i = 0; i < 3; ++i)
for (int j = 0; j < 3; ++j)
for (int k = 0; k < 3; ++k)
C[i * 3 + j] = A[i * 3 + k] * B[k * 3 + j];
return 0;
} The output of my pass is shown below. Loop 0:
Load: %28 = load i32, i32* %arrayidx52, align 4
Original: A[i37*3+k45]
Load: %31 = load i32, i32* %arrayidx56, align 4
Original: B[k45*3+j41]
Store: store i32 %mul57, i32* %arrayidx61, align 4
Original: C[i37*3+j41]
Loop 1:
Load: %18 = load i32, i32* %arrayidx27, align 4
Original: A[k*3+1]
Load: %20 = load i32, i32* %arrayidx29, align 4
Original: B[k/2]
Store: store i32 %add30, i32* %arrayidx33, align 4
Original: C[k*2]
Loop 2:
Load: %9 = load i32, i32* %arrayidx9, align 4
Original: A[j-1]
Load: %11 = load i32, i32* %arrayidx11, align 4
Original: A[j]
Load: %13 = load i32, i32* %arrayidx15, align 4
Original: A[j+1]
Store: store i32 %add16, i32* %arrayidx18, align 4
Original: B[j]
Loop 3:
Load: %2 = load i32, i32* %arrayidx, align 4
Original: A[i]
Load: %4 = load i32, i32* %arrayidx2, align 4
Original: B[i]
Store: store i32 %add, i32* %arrayidx4, align 4
Original: C[i] We can see my pass exactly captures the four loops in the program (in a reversed order), and recovers the original arrays and indices for load/store instructions, even for complex indices or multi-dimensional nested loops. Notice some of the variables are renamed by LLVM, so there exist variable names like DiscussionsThough I finally made my pass work, I still have to say, the documentation of LLVM is reeeeally terrible. In the very beginning, I thought the tutorial page should reflect the latest changes. However, it didn't, and was even somehow misleading. After searching on Google, I could only find how to get the new pass manager to work on a discussion thread! Moreover, even the doxygen provides the class method signature, it is still hard to understand what the methods are used for and how to use them. Actually I found pulling the whole llvm-project and directly searching some methods offline may be more helpful. Looking at real code snippets with contexts before and behind function calls is much easier to understand than directly looking at the documentation. Since I've been working on a MLIR project for several months, I can see lots of similarities between MLIR and LLVM, including how they traverse the function body and how the facilities look like. But I think the biggest difference between them may be the abstraction level. While LLVM IR is low-level and closed to real machines, many MLIR dialects are high-level and provide more facilities to operate on loops and memory accesses. At first, I wanted to do some real loop optimizations based on LLVM, but I found it was hard to print out readable code of loops and I could not easily convert the IR back to C/C++. However, with the affine/scf dialect in MLIR, we can dump the for loops in a human-readable C-like way, and it is also more intuitive to view and access the loops when doing optimizations like interchanging or fusing. After some programming practice in LLVM and MLIR, I kind of understand the rationale behind several layers of IR abstraction in nowadays deep learning compilers. Some machine-dependent optimizations are easier to express in a low-level abstraction, while loop or memory optimizations are easier to be implemented with a high-level IR, so progressive lowering is helpful to realize the optimizations in a suitable abstraction and gradually add hardware details to the program. |
Beta Was this translation helpful? Give feedback.
-
LLVM Pass: Global Value Numbering (GVN)I implemented a Common Sub-expression Elimination (CSE) pass using GVN following this course blog. To run the GVN pass:
The IR after the pass is printed in the console:
Notice that the TestingI developed my own test suite, and use LLVM's lit and FileCheck as testing tools. To run the test suite: lit tests In the tests, I use FileCheck's commands to check the common sub-expressions are indeed eliminated. For example, // CHECK: add nsw i32 %{{[0-9]+}}, 2
// CHECK-NOT: add nsw i32
// CHECK: store i32 %{{[0-9]+}}, i32* %{{[0-9]+}} The above checks states that there should not be an add instruction between
|
Beta Was this translation helpful? Give feedback.
-
My code is here. I implemented a pass that inserts yielding points into busy-waiting loops in C programs. BackgroundYielding points allow the OS to re-schedule threads instead of keeping them running forever. This is very useful when implementing busy-waiting loops. Consider the following code:
In this form, the loop will run until it sees
In the second case, our busy-waiting loop will NOT take the whole CPU time and will allow some other threads to get scheduled on that CPU while waiting. ImplementationMy pass is very simple. It works as follows:
Since I'm not sure how to insert calls to arbitrary functions, I wrote a helper function that implement the yield() functionality and in my LLVM pass, I'm looking for that function to get the right reference. Not the best way, definitely there should be better ways to get there. EvaluationFor evaluation, I use this example program. It runs 4 busy-waiting threads and one benchmarking thread that measures its execution time. If we run this benchmark in a constrained environment (say by only giving 4 CPU cores to it with Example of the pass run (the code dump is the new busy-waiting loop with yielding() points:
Some measurements:
|
Beta Was this translation helpful? Give feedback.
-
Pong InstrumentationSorry to be late! My code is here. ImplementationSince the assignment said to try to run it on real C code, I found an implementation of pong to use since it's a relatively simple program, where it is easy-ish for me to tell that it is still working. I needed to install a graphics library called SDL for this pong implementation to work. After looking through the generated LLVM IR instructions it seemed like there were a bunch of calls to SDL, so I decided that a relatively ecologically valid task would be to try to measure how long we spend in an SDL call. I followed the tutorial through the rtlib section so that I was able to instrument the code by adding a function that logs the start of a call to SDL, and the end of a call to SDL. Basically the start just logs when we started, and the end logs when we finished, adds the difference to a global counter, and then every quarter second of CPU time prints out where we are. I didn't think that much about how to exclude my instrumentation logic from the timing, so if it's adding a lot of overhead then the estimates are off. Also I think I'm measuring CPU time instead of wall time, but I'm not entirely sure. TestingPong seems to still run without crashing, and the code gives results like this:
ChallengesOof. There were challenges. The vast majority of my time spent on this project was in trying to make an LLVM pass that modified the output at all, with pretty smooth sailing after that works. The two big challenges were:
My two big lessons were:
|
Beta Was this translation helpful? Give feedback.
-
My implementation is here. Honestly, there were a good amount of failures with this one, but I'll get to this later. Installing clang was fairly straight forward. I actually use a Docker container running Debian for this class, so installation was even more straight forward, as I don't have a lot of things on the container besides Bril specific utilities. I wanted to make a pass that adds together constants, so I started hacking this solution together before really checking what example programs would emit in LLVM IR. So my solution was to check if an add operation had two constants, and add them together. What I failed to realize was the IR actually looks like this for
This is a problem, because even though I'm adding two constants this automatically gets folded into storing 0 and 82, and then summing them, so operands are never the I also installed perf to check wall clock time, but obviously no optimizations are being made so not much difference. The end result is an unambitious print something if two numbers are added. In my case, I'm just printing the instruction. I'm wondering if anyone else ran into this problem? |
Beta Was this translation helpful? Give feedback.
-
My implementation is here. I took a similar approach as @orkosinha, and set out to implement constant folding on multiplication instructions. Okay, I actually set out to implement constant folding on add instructions as well, but discovered there would be slightly fewer lines of code involved if I went with I ran into similar problems - intuitively, I expected constants to be represented as their own objects (similar to bril), but the intermediate representation actually uses stores and loads when dealing with variables that are assigned constants. In the following example, we store 15 somewhere and then load it into a new "register" so that we can multiply it by 1.
Additionally, I realized that LLVM already performs some constant folding for us! So
Is represented in the IR exactly like the instructions above; LLVM (or clang?) computed 3*5 = 15 before we begin processing the code. However, it does not perform constant propagation. So this is what I set out to do. My local propagation pass keeps track of each store instruction that stores a constant. If a load instruction loads from a location where a constant was last stored, then replace every use of the load instruction with the constant. I keep track of which pointers contain constants using a map. To actually perform constant folding on Manually inspecting the IR before and after my pass confirms that it is indeed replacing uses of load instructions with the appropriate constants, but my implementation leaves you with a lot of dead code. First, I ran into an issue where this for loop
leaves me with an infinite loop. I have no idea why; replacing it with Additionally, |
Beta Was this translation helpful? Give feedback.
-
My code is here Implement a PassI choose to implement an unambitious pass. The This was achieved by using After building the pass, use the command as follows to run it.
When compiling, the Pass would print When executing, we can see the result of
Output:
TestingI just wrote another simple program The running command is the same.
This will print the Integer Division Instruction and the transformed instruction.
DiscussionI was planning to transform the floating division instruction to floating multiplication instruction ( I tried, but didn't found a function like |
Beta Was this translation helpful? Give feedback.
-
IntroductionI decided to implement the path profiling algorithm presented in the Ball paper in LLVM. This ended up being a much larger project than I expected. I didn't realize that the algorithm to construct the Because this was an LLVM focused project, I spent most of my time focusing on the LLVM related parts of this project. As a result, I did not implement the ImplementationThere were five main parts to this project. The first was to assign values to the edges, the second was to get the chords of the maximum spanning tree, to assign increments to the chords, to initialize instruments, to place instruments, and to dump instruments. Assigning Values to the CFG EdgesThis part of the algorithm assigns values to CFG edges based on the number of paths through them.
This part of the algorithm required me to understand how LLVM CFGs work. At first I did not use the This part of the algorithm also required me to implement reverse topological ordering of basic blocks. In Bril, we could identify blocks by name. In LLVM, it seemed easiest to identify
Testing assigning Values to the CFG EdgesIn order to test this, I constructed some simple C++ programs that had different CFG structures and wrote to Get ChordsThis part of the project required me to add dummy edges from the exits to the entry. In the paper, there was only one exit. But LLVM programs can return from multiple program points. I carefully wondered whether I must add an edge from each exit to the entry or have all exits go to a common exit and that common exit to the entry. After considering it for a while, I believe that either method would work, but the second approach is redundant. Since there is no edge weight to these dummy edges in calculating the maximum spanning tree they would have a de facto weight of 0. Therefore either approach has no impact on the max spanning tree. I opted to implement the first approach. Then I negated all edges, ran Kruskals, got the minimum spanning tree back, and negated the edges back to normal. In my implementation, I implemented all parts of Kruskals using a naive check all edges for a cycle and add if no cycle. I did not complete my implementation of checking all cycles, but believe the kruskals algorithm is correct once that is implemented. Next, to find the Chords I iterated over all edges and added them to a chord list if they were not part of the max spanning tree. Assigning IncrementsThe next part of this was to assign increments. Ball presents a worklist style algorithm which I implemented. However there is a part to this algorithm that actually assigns the increments that is part of another paper of his. I skimmed the paper but did not implement this in my code due to time. Instead, I opted to make up fake increment values. Initializing IncrementsNow we get onto the fun LLVM related parts of this project. From here on is where I spent most of my time. This part requires us to initialize an array of counters:
Although this function is short and concise, I spent a lot of time constructing it. First I used this documentation to create an The next interesting part of inserting this was figuring out where to insert. I used the Placing InstrumentsThe next step was to place the actual instrumentation. The Ball algorithm increments along an edge. But LLVM instructions occur within At first I couldn't figure out how to create a phi node -- the constructor did not take in all the information required to create a phi node. Specifically, it did not take in the blocks and values to use. This seems like a poor design choice of the LLVM creators. Creating a phi node with some but not all required information and then calling methods to fill in the rest of the information felt wrong. Perhaps a Another thing that gave me trouble when creating PHI nodes was where to get the INC values from. Did I need to create an instruction in the previous block that put the inc value into a register? I ended up being able to pass a Dump ValuesThe last step was to dump increment values. I did not get to implementing this but I assume that I would write some C++ code, and have LLVM insert that, similar to how the blog uses ReflectionOverall, I was a little bit upset with how much I got this to "work". It really turned out to be a much bigger project than I expected. I can't discuss or present any evaluation because of this since I did not implement all parts. However, as someone who will be an LLVM contributor next year I found this project to be extremely valuable. I've seen some of the things that are easy in LLVM, some of the things that are hard, and have had an opportunity to think on what could be improved. I hope that I can take this experience with me and do something with what I've learned from this project. |
Beta Was this translation helpful? Give feedback.
-
My code is here. Sorry for being super late. I had a more ambitious target of building a generalized loop unrolling algorithm by finding the greatest common predicate, but somehow I couldn't get the algorithm work properly. So instead for this assignment I wrote a very unambitious pass which tries to canonicalize binary computations by simply swapping the two arguments if the first argument is larger than the second. It's probably not useful since in these cases we could simply do constant folding instead. I tested my code on the PI solver and also some dummy .c codes in my repo. I tested by manually observing the LLVM IR after running the pass. I was hoping that my loop unrolling could work properly so that I could have done some more interesting testings. The biggest obstacle I had was not even LLVM itself. I am working on Mac OSX with M1 chip. I had some issues when building and running LLVM. I thought the hardware was the issue but it turned out that I had duplicated libraries and when I upgrade the XCode tools it would rename some of the library links so I had some subtle errors which didn't seem to relate to this problem. |
Beta Was this translation helpful? Give feedback.
-
I wrote a very unambitious "pass" where the gimmick was you replaced every store instruction that used a constant integer value to a random integer value. So this pass actively hinders the program, but I didn't want to go and implement an actual optimization because I was tired from the past few lessons, so I just wanted to get it over with. It was still valuable though - I spent a decent bit crawling through the docs and learning some of the parts of LLVM. Some things were unintuitive - for example, I thought that assignment operators would be binary expressions (operator being "="), but instead they were under "memoryops", which makes sense when you think about it in terms of machine instructions and lower level stuff, but from a high level perspective was weird (since in the expression, "x = y = z = 3", to make the AST, you would treat it as a binary operator, at least from my previous implementations of it). And then I thought that the assignment instruction was load instead of store, which was another rabbithole. I "tested" this program by writing solutions to some coding problems, and sure enough the programs all broke - some just instantly terminated, some printed a bunch of random garbage, others segfaulted or infinite looped. I also went through the IR and checked that random values were getting stored every store instruction, which was the case. One observation I had when playing around with LLVM is that (at least without optimizations), LLVM rarely has phi instructions does it? I wasn't able to find any for the short programs I generated (which had loops and were decent complicated programs with data structures and stuff). Instead, they had a lot of "store" instructions, which at first I thought was mutating state, but it seems like it's their implementation of removing the phi nodes, by storing them into an instruction right before the merge. Is that true? |
Beta Was this translation helpful? Give feedback.
-
Share your experiences getting started with writing an LLVM pass here!
Beta Was this translation helpful? Give feedback.
All reactions