You already know how to use LLVM, but how is LLVM utilized to compile usable code? First, a parser is ran to create an Abstract Syntax Tree (AST) from input code. With this representation, we can modify it as we please. In this assignment, you will extend the given AST to allow for a toy C implementation. Feel free to use this setup as inspiration for future projects.
- Open Powershell as administrator (you can do this by right-clicking PowerShell in your start menu, and selecting Run as administrator
- Run
wsl --install
in the command line interface. This will install the Ubuntu distribution of Linux. - Restart your computer.
- Open the app Ubuntu on Windows. You should be tasked with setting your username and password.
You will have to enter your password periodically when you run a
sudo
(abbreviation of "superuser do") command on linux. When you enter your password, nothing will appear on screen (this is called "blind typing" and is normal), but the machine is still reading your keystrokes. - You should regularly update and upgrade your packages by running the commands
sudo apt update
andsudo apt upgrade
. Do this now.
- Run
sudo apt install clang
. This installs the CLANG compilers for C and C++, which are the compilers that LLVM uses. - Run
sudo apt install make
. This installs the Make utility, which is used for directing compilation of projects using "makefiles". - Run
sudo apt install cmake
. This installs CMake utility, which is used to generate the build files.Build files are files that describe targets that own the source files. Source files are files that contain source code, like .cpp or .java files.
- Run
sudo apt install llvm-dev
. This installs the LLVM (Low-Level Virtual Machine) package as well as its development package which includes libraries and headers for llvm source code. - Run
sudo apt install flex
. This installs Flex (Fast Lexical Analyzer Generator), which is a program that generates lexical analyzers out of flex code. - Run
sudo apt install bison
. This installs Bison which is a program that generates parsers using the - Run
sudo apt install graphviz
. This installs the graph visualization software that we will use to visualize control flow and data flow of compiled programs.
- Outside the Ubuntu on Windows app (using a browser of your choice, on your Windows machine) download the zip file at
https://github.com/Gota7/LLVM-Lab/archive/refs/heads/main.zip
.The folder inside the zip file be named LLVM-Lab-main and should contain a several files, including a .gitignore, a README.md, a CMakeLists.txt and a src folder.
- Copy the folder inside the zip file onto your desktop.
- Open the Ubuntu on Windows app.
- We need to change the working directory of Ubuntu to the LLVM-Lab-main folder on your desktop. To do this, use the
cd
command ("Change Directory") to traverse your system's file tree.The working directory you start in when Ubuntu on Windows is launched is
~
, which is semantically the same as/home/yourwslusername
(this is the starting directory of Ubuntu on Windows, and it can be configured)...
refers to the current working directory's "parent" directory (which is the directory that contains the working directory), so Runningcd ..
(note the space in between thecd
and the..
) while in~
will set the working directory to/home
, executing an upwards traversal of the file tree. Do this once more, and you should be in the directory/
, which is the root directory of your file system. - The "C:" drive on Windows is organized in WSL as the directory
/mnt/c
, so runningcd /mnt/c
will bring you to your "C:" drive. - From there, use successive
cd
commands to navigate your file tree (which from here on out, can be seen in your Windows File Explorer).
- Run
mkdir build
. This creates a new folder in LLVM-Lab-main named build. - Run
cd build
. This sets the working directory to../LLVM-Lab-main/build
. - Run
cmake ..
. This uses the cmake to write the build files into the current directory, build. - Run
make -j8
. This uses the files in the build folder to build the c++ files in the src folder, and places the resulting files in a new, automatically created folder named bin. - Run
cd ../bin
. This sets the directory to the new bin folder. - Now, when you run
ls
you should see a single file named LLVM-Lab. - Run
./LLVM-Lab
to execute LLVM-Lab. This should create 2 more files in bin, which are testMod.bc and testMod.ll. Both of them are LLVM source code files (one is in assembly, and one is in bitcode), and each can be compiled into an executable program. - To execute one of these files, run
lli testMod.bc
. Congratulations, you have executed your first LLVM-interpreted program!
Every time you want to re-build and run the project, you must this section. This can be tedious, and for this reason we would like to present the opportunity to dip your toe in some shell scripting, to automate the process.
- Create a file named run.sh in LLVM-Lab-main. You can do this by creating an empty text (.txt) file in, and renaming it to run.sh.
- Edit the file using a text editor. Type the command you would like to run, in order.
- To run the bash script, run
./run.sh
. This will execute each command in order, so you don't have to do it manually.
- In the bin directory, run
opt -dot-cfg-only testMod.bc
. This will create a .main.dot file, which contains a text representation of testMod's CFG. - Run
dot .main.dot -Tsvg -o testMod.svg
. This allows graphviz to interpret .main.dot into an image file named testMod.svg, which can be opened via the browser.
Your task is to make some modifications to the AST code. If you need help, you can always look at the guide section below for an overview on the design of the AST system. Also note that you are free to create test cases by making C files in the tests
folder and running the ./executeTests.sh
file. Their output compiling log, execution log, and LLVM assembly will be generated.
The addition expression is added, but not subtraction. Shouldn't be too different right? Remember that there is a difference between integer and floating point operations. This hint will apply to the other operators.
We have code for converting booleans to integers, but the header file is missing. Could you make one for it? Make sure to make a static Create
method for it like the other expressions have.
It would only work on LLVM@17 or ealier. As the LLVM@18 removed IRBuilderBase::getInt8PtrTy here.
In C, when we have an integer expression we can use it as a boolean value (Ex: if (num)
). How do we know whether or not the number results in true or false?
Our group will be modifying the compiler we have been working with to support a subset of the python language. To accomplish this we will be modifying the existing language, as well as implementing changes to the AST to support some more niche features of Python that are not supported in C.
We need code to generate constant boolean expressions! There is code for int
and float
. Remember from the last assignment how booleans are represented in LLVM?
The boolean or operator is given, though and has yet to be implemented. Make sure your header is and.h
and has the class called ASTExpressionAnd
. Make sure to include the static Create
function as always. Also remember that when executing an and operation, the 2nd operand is not evaluated if the first is false. Also note that making stack variables when not necessary is bad practice. There is no expression that requires declaring a stack variable. You are not allowed to use alloca
here. No heap memory either.
Here are some hints:
- How and statements work is a bit tricky to understand, so an example helps.
Consider the following:
if (x && y) doThing();
Note that one property of the && operator is that if x is false, we don't calculate what y is to save resources. But this involves jumping as well, so we must have a layout such as this:
So far, we have arrived at the Python general grammar. After reading through it, we have created a subsection of the grammar that only relates to the functionalities that we plan on implementing in our compiler. Our next step is to start implementing our grammar. We referenced the Official Python site to design our new grammar. We have written the lexer and parser.
- Finish up function decerations
- Fix Else statement token issue.
- Refer to Readme within
src
folder.
So if we are ignoring the function call we do, tmp
ends up representing the result of x && y
. We know that if we don't need to check the right side, we have a false value. We also know that the final result depends on the right side. If we can't use stack memory, how do we get the result? There are 2 ways to do this, and one is much easier than the other :>
So far, you have been working with expressions. For loops are statements rather than expressions. Create for.h
in the src/statements
folder and have the class be named ASTStatementFor
. Its constructor should take an std::unique_ptr<ASTStatement>
for the body, init, condition (actually this one is std::unique_ptr<ASTExpression>
), and increment in that order. Note that anything but the body is allowed to be a nullptr and should be by default.
Here are some hints:
- It is completely possible for a for's condition to never be true, so even if does return something it's not confirmed.
- An exception to the above is if the init function returns something.
Suppose we have the following for loop example:
for (init(), condition, increment())
{
doThing();
}
We can desugar it to the following:
// Redacted.
This is quite literally just the while loop with extra steps. But what if any of these components are null?
- init - Just don't include the statement in the entry block.
- condition - Just always jump to the for loop body.
- increment - Just exclude the increment part from the for loop body.
Another thing to note is that if any of these return at any time we are just done building the loop.
Welcome to the AST! Feel free to extend this for your own future projects. But how does it work? Like LLVM, the AST itself has multiple parts that work together.
- VarType - Represents a high level type. A type must be able to be converted to an LLVM type and check for equality with other types.
- Expression - Anything that results in a value without the need to declare stack memory or basic blocks (though you still may use basic blocks if it makes things easier). An expression must determine whether or not it results in an L-value or R-value. There are also static casting functions given to help if needed.
- Statement - Requires creating basic blocks but no stack allocations.
- ScopeTable - Maps variable names to types and values. Currently, ASTs have a scope table for "globals", and each function has a scope table for locals. If a local variable is not in scope, it will search for globals. This scoping system does not support arbitrary layers for this simple language, though you are welcome to add it if you want.
- Function - Declared with a function type and name, have stack variables declared in advance and can be given an optional definition. Functions are allowed to be variadic (have infinite parameters).
- AST - The abstract syntax tree. It contains functions, a scope table, and a function pass manager for optimizations. The
demo.cpp
files shows how everything joins together.
- L-Values are anything you can load from or store to (like a variable).
- Raw addresses are not L-Values, but they can be converted to one by using the dereference
*
operator. - You can only assign to L-Values (when's the last time you ever assigned to
5
or the result ofa + b
)? - Any L-Value can be converted to an R-Value by loading from the L-Value.