-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion `nwritten == buf_size' failed to hold (Again) #3779
Comments
So, I did a bit more digging - the problem seems to be that two threads have their rr scratch mappings right next to each other. From my output gist:
the mapping is an RR scratch mapping (it's rwxp) but it's gown from I actually managed to fix this I think, with this patch, which whacks an extra guard page at the end of the scratch mapping. This was actually discussed in #3325 -
I'm hesitant to open up a PR with this because I don't really understand why the signal handler frame is running 80 bytes past the end of the syscallbuffer, but this patch does seem to fix my problem so hopefully that's instructive to people who do get how this all works? |
This should hopefully resolve rr-debugger#3779
Hmm, so the issue here is that we read/write off the end of the buffer (because our signal handler stack frame size is a conservative estimate), and then the mapping beyond exists during recording but doesn't during replay? |
Yes, that’s what is implied by the logs printed during recording by the |
In the original trace at the top of the issue the scratch buffer of tid 20517 (cloned from 20235) is placed at 0x7ff8cbc00000 at event 66034. Earlier at event 65800 tid 20235 vforks to tid 20516 and it's scratch buffer is placed at 0x7ff8cbe00000. 20516 then execves at 65952. The vfork returns at event 66032 (i.e. immediately before tid 20517 is created). I suspect the vfork/execve is relevant here. |
I don't see why the scratch buffer at 0x7ff8cbe00000 would be hanging around though. I can't reproduce this in a simple example, and I would expect that this code Lines 1062 to 1070 in 77f88f4
|
I don't think we actually want a PROT_NONE guard page here. It would mean consuming an additional VMA per thread increasing our overhead for large processes. |
I changed Ruby (the program under test) to call
I agree. What I actually wanted to do first was name the VMA with
I don't think I'm seeing the warnings that prints, but I'll add some extra logging around there and see if anything interesting turns up. |
Oh no, I spoke to soon. When recording with
So this is happening. |
That task seems to be explicitly excluded because of EDIT: Sorry, you mean the task that called |
Right. |
Does this fix your issue? diff --git a/src/AddressSpace.h b/src/AddressSpace.h
index 2eae98a3..cb095623 100644
--- a/src/AddressSpace.h
+++ b/src/AddressSpace.h
@@ -315,6 +315,8 @@ public:
IS_RR_PAGE = 0x8,
// This mapping is the rr vdso page
IS_RR_VDSO_PAGE = 0x10,
+ // This mapping is the per-thread scratch space.
+ IS_SCRATCH = 0x20,
};
uint32_t flags;
};
diff --git a/src/record_syscall.cc b/src/record_syscall.cc
index 4817818e..a74b24fd 100644
--- a/src/record_syscall.cc
+++ b/src/record_syscall.cc
@@ -3375,6 +3375,7 @@ static void init_scratch_memory(RecordTask* t,
KernelMapping km =
t->vm()->map(t, t->scratch_ptr, sz, prot, flags, 0, string());
+ t->vm()->mapping_flags_of(t->scratch_ptr) |= AddressSpace::Mapping::IS_SCRATCH;
struct stat stat;
memset(&stat, 0, sizeof(stat));
auto record_in_trace = t->trace_writer().write_mapped_region(t, |
I have no idea how, but yes that does fix my issue! |
Fixes rr-debugger#3779 (somehow)
Setting the IS_SCRATCH flag causes
Cool. I still don't understand why the scratch buffer is leaked in this particular case but given that there are certainly general cases where they can leak I think we need to take something like this. |
Ah - I see, this check becomes true - Line 1760 in 77f88f4
|
One way to fix this would be for the callers of (Also, in the absence of I'd really prefer to avoid having to track |
One thing that might work well enough is to add a flag to |
It sounds like you're proposing:
However, if the only place we think this can happen is in the Line 1218 in 03f6446
I'm happy to have a go at implementing either of these. I'm definitely not very knowledgeable about all this stuff though, so apologies if I've said something dumb! |
Yes, that should work. There is another use of |
I opened a PR for this #3800. Let me know what you think - I don't write much C++ so some of my taste is pretty awful!. |
This appears to be a regression of #3325 in just about every way, but I figured I'd open a new issue since that was 2 years old.
I've got a recording that crashes during replay 100% of the time with this exception. I can also semi-reproducibly re-record it and obtain a crashing replay again.
Tail of trace dump:
The program is the Ruby test suite. If you want to download 500MB of build product, I tar'd up the directory here: https://1drv.ms/u/s!AjbrX5Fo8lg8jZ4N9tW3fEUTFPd-JA?e=zFApbp
There's a packed trace exhibiting this behaviour in
ruby/build/rr_trace
. You can record another one withcd build; rr record --output-trace-dir rr_trace -- make TESTOPTS="-v --tty=no $(realpath ../bootstraptest/test_ractor.rb)" btest
Like in #3325, the captured write seems to be past the end of a scratch area. I ran
info proc mappings
from the emergency gdbserver and saw that the mapping it was writing into wasrwx
which I guess probably means it's a scratch mapping from RR itself?info proc mappings
Also, based on the steps in #3325 (comment), I also applied a similar patch to my rr:
This is the output I get with it (both record and replay): https://gist.github.com/KJTsanaktsidis/24cf028f25e7c511b5deee3e62643dd4 (n.b. ignore the crashing Ruby process/stack trace it prints out; that's the issue I wanted to debug, but I got distracted by this instead!).
Also, my machine:
/proc/cpuinfo
Thanks for taking a look at this :) If I can do anything to debug this further please let me know!
The text was updated successfully, but these errors were encountered: