Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in crun 1.16.1 causing occasional error "corrupted size vs. prev_size in fastbins" #1537

Closed
bduffany opened this issue Sep 1, 2024 · 3 comments · Fixed by #1538

Comments

@bduffany
Copy link
Contributor

bduffany commented Sep 1, 2024

We are running some workloads using crun which started occasionally failing after upgrading to crun version 1.16.1. In a significant fraction of these workloads, we started seeing the error "corrupted size vs. prev_size in fastbins." I'm not very familiar with this error, but some basic searching indicates that this might be happening due to an invalid memory access.

Notably, the issue doesn't reproduce when running crun via podman. Our usage of crun is a bit unusual - we are not invoking it via podman and we also don't run conmon, because we are trying to reduce overhead as much as possible. Instead, we generate an OCI container spec and then directly invoke crun. We tried to make the generated container spec match podman's as closely as possible.

To try and find the exact commit where this issue was introduced, I did a git bisect between 1.16 (a known good version) and 1.16.1 (the known bad version), building statically linked crun with the nix method outlined in the README. The bisect revealed that this behavior started happening in commit 72b4eea.

Unfortunately, I don't have a minimal repro yet... it's difficult to find a minimal repro in this case because these are customer workloads where we don't have access to the source code. What I do know is that the executable for this workload appears to be nodejs 22. Unfortunately, strace proved unhelpful, because the issue doesn't reproduce under strace. Even worse, if I wrap the workload with any wrapper process whatsoever (even just a simple sh -c 'exec <executable> <args...>), the issue doesn't reproduce.

I am continuing to investigate to see if I can find a minimal repro, and I am also planning to dig into that commit to see if I can find exactly what it changed that might be causing this bug, but I thought I would file a report anyway just to raise an early warning, and just so that other people can more easily find this issue if they start seeing this behavior too (it took me a long time to track down this commit as being the culprit).

@bduffany bduffany changed the title Regression in crun 1.16.1 causing memory issue "corrupted size vs. prev_size in fastbins" Regression in crun 1.16.1 causing occasional error "corrupted size vs. prev_size in fastbins" Sep 1, 2024
@bduffany
Copy link
Contributor Author

bduffany commented Sep 1, 2024

Looking at the change in 72b4eea

      if (process->user == NULL && container->container_def->process->user)
        process->user = container->container_def->process->user;

I wonder if this should be a deep-copy of container->container_def->process->user, rather than just copying the pointer? Similar to how the preceding lines in this file are using xstrdup to create a clone of the apparmor profile / selinux_label?

If container is recursively freed and then process->user is dereferenced afterwards, then there is a use-after-free bug here - though I haven't followed the control flow yet to see whether this is what's happening in practice.

@bduffany
Copy link
Contributor Author

bduffany commented Sep 2, 2024

After tracing the execution a bit more (using print statements 😅) I think there is actually a double-free bug, not use-after-free like I originally thought.

  • In crun_command_exec here, there is a deferred recursive free on the process object
  • In libcrun_container_exec_with_options here there is a deferred recursive free on the container object
  • Both objects are referencing the same process->user pointer at the time they are cleaned up, so the cleanup in crun_command_exec does a double-free on the process pointer.

If I add a print statement before and after the call to free_runtime_spec_schema_config_schema_process in exec.c, only the first print statement is executed, which I think basically confirms that it's doing a double-free?

@bduffany
Copy link
Contributor Author

bduffany commented Sep 3, 2024

Just commenting to link #1543 to this issue since it addresses an issue with the original fix in #1538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant