-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guide: Debugging an issue across the RV stack #4542
Comments
@JuanCoRo asks:
The point is to get away from Nix builds of K if you think you might want to tinker with K internals, or if the bug is likely to be in K itself. If the bug is actually more likely to be in your semantics / ecosystem project, then using
Fundamentally, Where I find the local checkout works a bit better is when I'm changing multiple components and want to iterate quickly - The other case is using poetry to point your semantics at an ad-hoc local checkout of Pyk to add logging or make changes to the underlying Pyk code when you're working:
If you're in a deeply nested context like Kontrol or Kasmer, this is annoying to do via Nix but trivial using a local build. |
If you had to pass -tailcallopt to the code generator to get a tailcc function to emit a tail call when calling another musttail tailcc function, that is a bug in llvm and should be minimally reproduced in a .ll file and reported upstream to llvm. The fix you describe is probably fine as a workaround solution if it indeed solves the problem though |
Alternatively, if the call site isn't musttail, maybe that's the problem? |
Yeah, I was aware when writing up that this fix isn't totally in the spirit of your TCO PR, but in the interests of getting my fix merged upstream I opted for the concise band-aid fix.
Almost certainly so: this case would probably be pretty easy to track down the call site for and fix in the style of your PR. You'd probably want to also run through the codebase to see if there are any other places that morally should be |
Introduction
This issue is an informal walkthrough of how I solved a tricky issue in K recently. I'm writing this from the position of the issue in question being fixed, so while I'm confident that everything I've written here is correct, I may be slightly paraphrasing or abbreviating my thought process from the time that I actually fixed the bug.
The issue in question touches pretty much the entire stack of tools we use at RV, from an ecosystem-specific semantics project all the way down to the internal details of the LLVM backend. It should therefore (hopefully!) be a good resource to consult when something's broken and it's not immediately clear how to proceed.
Preliminaries
I intend for this guide to be useful to anyone at RV. Even if you've never worked directly on K, almost all of our ecosystem projects touch it in some way, and having some more context on what might be breaking when you see an error should at least help you to do better initial triage and referral of your issues. If you do work directly on K, then hopefully you'll be able to take away some useful tips for working and debugging across the entire project in practice, rather than just on small examples.
Important
Callouts like this will highlight the most important general principles that become apparent from this walkthrough.
Where possible, I will include details of how you can reproduce the individual steps, or at least a pointer to the code I wrote to try and do so. Please read this walkthrough in detail if you have any interest at all, and if there's anything that's not clear, or if you have follow-up questions for me, please ask!
Thanks to @goodlyrottenapple and @jberthold for their help pinning this issue down originally.
Walkthrough
Initial Report
The first evidence of this issue is a somewhat innocuous message from @jberthold when we were debugging a separate issue in the Kasmer project:
Originally, we'd been debugging an issue that manifested during concrete execution, where the LLVM backend would fail to rewrite a program correctly. After diagnosing and fixing that issue, the Kasmer CI was able to get further through its test suite, but still failed on one configuration during a smoke test of symbolic execution and proving (example CI failure is here):
From the CI error message, it's not clear precisely what's going wrong, or even what part of the stack is responsible: the Pyk code responsible for orchestrating a proof in Kasmer is receiving an empty response from the Haskell Backend, but exactly what's causing that isn't obvious.
Important
It's really useful to have a general picture of our active projects, and the dependency graph between them in your head, for a couple of reasons:
In this case, the project in question is Kasmer-MultiversX. This is a semantics project that has a deeply-nested dependency chain (Kasmer depends on the core MultiversX semantics, which themselves depend on the WebAssembly semantics, and from there on K).
Local Reproduction
Nix CI
The first step to take when investigating any kind of issue is to reproduce it locally on your own machine by recreating as similar an environment as possible to the failing CI run.
Important
Being able to interpret the source code of our GitHub Actions setup is important; knowing what commands are actually being run in CI means that it's much easier to pin down a reproducible example. Every project we maintain has a
.github/workflows
directory that contains YAML files defining what runs in CI; if you see something failing in CI then searching the YAML workflow files for a string you see in the Github UI is usually a good place to get started.In the case of this issue, the failing CI interface looks something like this:
which lets us pin down the relevant section of workflow YAML:
and therefore to a command that we can run locally to see if the failure happens:
$ GC_DONT_GC=1 nix develop .#kmxwasm-test-shell --extra-experimental-features 'nix-command flakes' --print-build-logs --command './package/smoke-test.sh'
In this case, the failure reproduces easily on my machine with this exact command and I can begin narrowing down exactly what causes it.
Note
We got a bit lucky with this issue in that the failure is exhibited in a Nix CI job. Generally, these examples are really easy to reproduce on a local machine because of how Nix isolates build dependencies and environments. A problem would need to be extremely esoteric before it fails to reproduce in a Nix environment on the same OS.1
If your issue isn't in a Nix build on CI, you might need to do some additional setup steps to get a local environment that reproduces the issue. With luck, you can just run the same commands that CI runs on your local machine, but in some cases you might need to set up a Docker image that reproduces the image being run in CI. I won't go into that here as it's not on the critical path, but if you're interested in knowing more about this please let me know.
Using a Local Build
Nix builds have their advantages (i.e. that dependencies are completely pinned, and the build environment is hermetic and reproducible). However, they can be tricky to use for local development because of how the build system works.2 I therefore suggest trying to extract a build that uses non-Nix dependencies that also reproduces the original issue; this will generally make it a lot easier to tinker with K / Pyk code as you narrow down on what the problem is.
To do so, what I usually do is check out the appropriate version of K locally and set the PATH variable in your project directory to point at it:3
We can see from the project's
flake.nix
that Kasmer is built usingkdist
, and so armed with that local version of K we should be able to build it with:4Once this finishes, we can then run the same test command that we were previously running inside the Nix development shell:
On my machine, this happily also reproduced the issue, and so we can leave the Nix configuration behind for the time being.
Bug Reporting
At this point, it's worth taking a step back and looking at what the error messages from the Kasmer smoke test are telling us is happening. We can see that there's a Python stack trace:
which suggests that something is wrong during symbolic execution; specifically that Pyk is getting an empty response back from the KORE RPC server when it makes an execute request for some configuration.
The issue here looks like something crashing in the Haskell backend, so I assembled a bug report from the failing execution of Kasmer.5 A bug report here is a set of files zipped together that can be used to replay the exact request-response cycle that triggers the bug. It takes a bit of setup to get a bug report replaying in the backend, so I asked @goodlyrottenapple to take a look at the example I'd collected while I looked at some other work.
Important
If you're working with K it's important that (even if you don't understand in depth how each component works) you understand the broad architecture of how the different pieces fit together. For modern K projects, this usually entails an ecosystem-specific layer of Pyk code that orchestrates calls to one of the K backends (the Haskell backend via an RPC layer, and the LLVM backend via a set of native-code Python bindings). There are a lot of different projects involved when a semantics runs, so it's useful to get a sense of where the data is going.
LLVM Backend
Sam came back to me with the assessment that the requests in my bug report were not triggering a problem in the Haskell backend per se, but were in fact causing the LLVM backend shared library module to crash when the problematic execute request was sent. I verified this by running a slightly different version of the Kasmer command in the smoke test script:
$ kasmer verify test_call_add
This non-booster version runs slowly but successfully, which suggests that the LLVM backend is indeed the issue.6
Note
The Haskell backend uses the LLVM backend heavily to perform concrete simplification and function evaluation. To do so, we compile a definition into a shared library that the Haskell backend can load and make calls to via
dlopen
/dlsym
.The Python stack trace doesn't give us any helpful information about why the backend crashed, so I began to wonder at this point what the easiest way to get an LLDB shell for the problematic booster process was. I weighed up a couple of different ideas, each of which would have required a reasonable amount of investigation and inconvenient setup code to get working properly:
lldb kore-rpc-booster
for me, with interactivity working properly.lldb kore-rpc-booster
myself in a shell, and have Kasmer connect to it via a known port rather than spawning a server myself.Fortunately, I found a much easier way to get a debugger shell that involved minimal setup code. This Stack Overflow answer suggests that it's possible to get LLDB to wait for a process with a particular name to launch, then attach to it:
That way, when
kasmer
eventually starts a booster server after running its preprocessing steps, we'll get dropped into an LLDB shell for the booster process, from where we can try to debug the issue.7 To summarise; in one shell start LLDB and wait for a process as above, then in another shell startkasmer
:$ kasmer -C tests/contracts/test_adder_passing verify test_call_add --booster
When we run to the point where the server crashes, we finally have an informative stack trace that shows us what's happening:
Suspicious Symbols
Let's walk through this stack trace to try and get a better picture of what's actually going on in the LLVM backend. At the very bottom (#14) is the Haskell function that's calling into the LLVM backend; the API entrypoint that it reaches first to do so is the
kore_simplify
function, which takes a KORE term as input and performs concrete simplifications on it to produce a result term (essentially, this is a question of evaluating all the function symbols in the input term).Eventually, the backend needs to evaluate a rewrite rule (the one with ordinal 162908), and in doing so calls the hook
BYTES.int2bytes
. We can look up the C++ source code for this hook in the LLVM backend repo and confirm that it does in fact callmpz_clear
.9Warning
At this point, we've reached one of the critical moments in the debugging process that relied on my accumulated time spent working on this code. I'm afraid there isn't a better way to explain my train of thought other than "I've seen something like this before", but hopefully the post-hoc explanation is good enough to justify the leap.
This immediately looks suspicious to me; I know from previous experience that we replace the default memory management functions in the GMP library with our own GC-enabled ones. If for some reason the default free is called on a GC-allocated integer, we'll get a segfault much like this one. The previous way I'd seen this happen was if we didn't call
kllvm_init
after loading a shared bindings library. Using LLDB I quickly verified that we were calling this, and so the underlying reason must be different here.After scratching my head for a while and not getting anywhere, I noticed something odd about the stack trace:
In these lines, the
__gmp_*
symbols are coming from thekore-rpc-booster
executable, rather than from the GMP shared library. I confirmed that these symbols are in fact in the binary:and found a dependency that looks plausibly like it might statically link GMP.10 Another old PR came to mind at this point: previously, we'd fixed an issue where a function called
step
in the LLVM backend would get mixed up with an identically-named function in libc. Perhaps the same is happening here.Minimal Example
At this point, I've got a reasonable hypothesis about what might be happening:
We can now make a minimal example using the LLVM backend's testing framework for the C shared library bindings. This K code:
should crash if compiled into a bindings library and executed on macOS. Happily, this is exactly what does happen, and we can be reasonably sure that the example is a faithful distillation of the original issue.
Important
There are lots of different setups for running tests across K and the various subprojects; it can be tricky trying to figure out the exact idioms for running tests for different projects. After years working on K I still occasionally find myself asking people how to run tests in a particular place, but it's worth generally having an idea how to run tests in different projects.
This section also highlights a really important principle: it's substantially easier to debug an issue if you've minimised it and extracted the key parts into a smaller reproduction. Here, instead of a 20-minute semantics recompile and remote LLDB session, I've turned the original issue into a single C+K combination that runs (and fails, importantly!) in the backend's own test harness. This makes my iteration time seconds rather than half an hour when experimenting.
Research: macOS Dynamic Linking
Having validated this hypothesis, we now have to put the legwork in and do some research. I don't fully understand why the issue happens, but need to try and figure it out in order to fix things. I begin by looking at the
llvm-kompile-clang
script, which is an internal part of the backend's compiler that deals with invoking Clang properly to link the backend's runtime with a module generated from a semantics. A quickgrep
fordarwin
identifies a few places where we specialise behaviour for macOS, one of which looks relevant to the linker and symbol resolution issues we're investigating:After some time googling, it becomes clear that perhaps the
-flat_namespace
option is problematic; this issue suggests that enabling it causes symbols to be mixed up between different dynamically-linked libraries. This sounds very plausibly like what we're dealing with. I removed the option, rebuilt the backend, and ran the minimal test case from before: no crash! It appears that this option is indeed the problem.11Fixing the Issue
With the problematic flag removed, I re-ran the LLVM backend test suite and find a couple of breakages:
Some debugging reveals that all three of these are instances of the same abstract problem: with flat namespaces enabled, we were accidentally linking two incompatible copies of the same library and relying on coincidentally getting the correct symbol resolved by the dynamic loader; when the correct symbols get resolved the incompatibility is exposed. The fix in each case is to only link one canonical copy of each library.12
After doing so, we can re-run the test suite:
The LLVM backend test suite is now working, so I push my branch up to GitHub and watch the CI runs to get a sense of how it's doing on other platforms.
Stack Overflow
Everything seems fine from the perspective of the LLVM backend's CI, so the next step to take is to retry the original (slow) reproduction from the Kasmer project.13 Because we've modified the LLVM backend, we need to tell Nix to point at that version rather than the version locked by Kasmer's default flake inputs:
Important
K is made up of a lot of moving parts, and when debugging an issue it's important to be able to construct a working K / downstream project installation that contains the components you're interested in at fixed versions. In the line above, I'm using Nix flake input overrides to point the LLVM backend at the version with our bug fix applied. There are other ways to do this: for example, you could check out a "local monorepo" of K and apply changes directly in submodules, and for Python dependencies Poetry will let you point at local checkouts. If you can't figure out how to produce "project X with dependency Y overridden", then ask in #k!
We're getting the same error again despite our fix being applied and the entire K toolchain having been rebuilt.14 It seems like we're back at square one on our debugging journey despite having fixed the issue in the minimal reproduction of the original problem. My first thought at this point is that there must be some property of the original Kasmer issue that I failed to capture properly in the minimal reproduction. However, using the same remote LLDB shell trick from earlier reveals a stack trace that looks a bit different to before:
This trace looks suspiciously like a stack overflow (that is, we've made too many recursive calls, and the process has run over the end of the stack space that the operating system has given it - doing so produces
EXC_BAD_ACCESS
on macOS). However, it does suggest that the execution is getting further than it was when we were crashing because of__gmpz_clear
.@jberthold helps me identify (on Slack) that this isn't anything to do with GHC's worker threads or Haskell compiler settings, and so I poke a bit more at the call stack and see that each frame in the backtrace is trying to print a symbol like
Lbl'UndsUndsUnds'WASM-COMMON-SYNTAX'Unds'Instrs'Unds'Instr'Unds'Instrs{}
. This suggests that it's a legitimate stack overflow encountered when printing a large K cons list out.Research: Backend TCO Changes
I suspect at this point that there may be something else that's changed in the LLVM backend, so I look at the range of K versions that the original problematic PR is changing:
Then, you can use GitHub's compare tool to see what's different between these K versions: v7.1.30...v7.1.70. We're interested in the LLVM backend, and can see that it's been bumped by a few versions:
Looking at the commits in that range, there's one smoking gun that jumps out as an obvious candidate: runtimeverification/llvm-backend#1097. This PR makes some changes to how tail-call optimisation in the LLVM backend is applied, which seems very likely to be involved in a new stack overflow appearing when there wasn't one before.
Fixing the Issue (Part 2)
The fix for this issue turned out to be reasonably straightforward. Per Dwight's PR description:
The PR diff identifies two such cases where we add the
musttail
attribute to function calls, neither of which correspond to thevisit_children
call from the stack trace. The PR suggests that it removes the need to apply-tailcallopt
from LLVM, and so I do a quick grep through the LLVM source code to identify what that option actually does internally. I realise that it (as well as IR-level effects on the generated LLVM) sets a flag for the low-level code generator that we currently aren't setting in the backend:I enable this flag, and set the default optimisation level for Kasmer to
-O1
15, then push my changes and retry the original issue reproduction again:The test passes!
Success and Final Steps
At this point, the issue in the original reproduction is fixed, and I can compile my changes into a coherent PR, ready for review: runtimeverification/llvm-backend#1109.
Important
There are a couple of important things to take care over even once a tricky problem is resolved. The first is to ensure your PR has a detailed writeup to ensure that in the future, anyone looking at this code can understand your reasoning and motivation for making a change. It's much harder to reverse engineer from an empty PR message!
The second is to babysit your changes if they need to reach a downstream project. Here, I was waiting for my PR to reach Kasmer, and so I made sure to keep an eye on and approve each of the dependency update PRs (to K, the WASM semantics, the MX semantics, and eventually Kasmer) in turn as they arrived to make sure they would pass CI. If you don't know how to do this or where your code will end up after it's released, please ask someone!
Summary
This is a long read, but I'd be really grateful for any feedback, comments or thoughts. If there's anything at all where I've glossed over a detail, or you'd like some more insight into how I reached a conclusion or solved a problem, please comment. I want to leave this document as a useful resource for RVers, and so if there's anything I can possibly add to it I'd be really pleased.
Footnotes
This is tempting fate, obviously. ↩
We can largely leave Nix behind soon in this guide, but if you'd like to know more about how our Nix infrastructure works then @goodlyrottenapple is the right person to talk to. ↩
Using direnv is a nice way to automate this. ↩
This will take a very long time (20+ minutes on some machines); it's a known issue with this project. ↩
I did have to patch the Kasmer project to get this to work properly; I would say that for any project using the symbolic backend in earnest, it's critical that the bug reporting feature is wired up to the command-line interface properly. ↩
The non-booster version doesn't load an LLVM backend shared library; it executes using pure Haskell. ↩
Note that if you end up using this technique in practice, you'll need to disable bug reporting. When Pyk populates a bug report, it runs
kore-rpc-booster --version
before spawning the actual server, so you'll end up attached to the version process. Getting LLDB and a bug report is left as an exercise to the reader and will require some modification to Pyk. ↩The LLVM backend assigns every rewrite rule (axiom, really) in a KORE definition a unique integer identifier (ordinal). I didn't ever actually need to do this to debug the issue, but you can use
llvm-kompile-compute-loc
to tell you what rule has a given axiom. ↩The details aren't critical, but I believe
mpz_clear
is a macro wrapper around the internal function__gmpz_clear
. I didn't need to go further into the weeds than this but they're 1:1 in correspondence as far as I can tell. ↩@goodlyrottenapple made a heroic effort at this point to try and excise the static dependency from the booster, but it was too tightly baked into GHC to be viable. ↩
Feel free to follow up with me in more detail if you're interested in the internals of the macOS dynamic linker; this description is really under-selling how much research it took to convince myself of what the morally correct solution was here! It's just not directly relevant to the K discussion, so I'm omitting the goriest parts. ↩
The debugging here is standard C++ debugging; I'm again glossing over some details to stick to the "K stack" story. I'm happy to explain more! ↩
Remember that? Feels like a long time ago, right? ↩
At this point, by reading, you get to save approximately an entire day of developer time. ↩
Per Dwight's PR: "Convert TailCallElimination and Mem2Reg from required to optional passes on
-O0
."; at-O0
we'll still get the huge stack trace because TCO won't be enabled everywhere. ↩The text was updated successfully, but these errors were encountered: