-
Notifications
You must be signed in to change notification settings - Fork 5k
Enable new exception handling on win-x86 #115957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/azp run runtime-coreclr outerloop |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
|
I'm not so sure it's the same root cause. I'll look at the dumps later on. |
The JIT.opt stress test is failing when reporting GC references at the More than one value in the dump seems off, but at least it explains why I didn't see this in earlier gcstress pass. Disassembly:
Locals in EnumGCRefsX86:
GC info:
Presumably it's trying to report |
Obligatory JIT dump for the above: https://gist.githubusercontent.com/filipnavara/87489fc9b6224706111055ce1ee79583/raw/80122358359dbe95e53fd7708c59611f7024aa87/gistfile1.txt |
I'm starting to suspect that the liveness of the variable may be misreported on other platforms too. We just get away with it because the funclet prolog is marked as x86 JIT dump: https://gist.githubusercontent.com/filipnavara/87489fc9b6224706111055ce1ee79583/raw/80122358359dbe95e53fd7708c59611f7024aa87/gistfile1.txt |
It is an expected invariant that prolog/epilog are no-GC regions. Why did this issue not manifest for filter funclets in the old scheme (because filter-live variable are always pinned?)? |
Yeah, and that's fine. x86 GC info always encoded the (non-funclet) prologs/epilogs and treated them as no-GC on the VM side. Until #115630 the funclet prolog was empty on win-x86 so we didn't run into this problem because there was no code in the no-GC region. Likewise, the funclet epilog is single "ret" instruction so we don't really run into problem either. (These assumptions are not true for linux-x86 which has additional stack alignment instruction and which would have likely run into this problem if someone ever got far enough to run GC stress tests.) I did, however, expect that calling
Not sure I know the precise answer. I remember stumbling upon some specific |
Looks like we have a comment that this is indeed intentional: void CodeGen::genReserveFuncletProlog(BasicBlock* block)
{
assert(compiler->UsesFunclets());
assert(block != nullptr);
/* Currently, no registers are live on entry to the prolog, except maybe
the exception object. There might be some live stack vars, but they
cannot be accessed until after the frame pointer is re-established.
In order to potentially prevent emitting a death before the prolog
and a birth right after it, we just report it as live during the
prolog, and rely on the prolog being non-interruptible. Trust
genCodeForBBlist to correctly initialize all the sets.
We might need to relax these asserts if the VM ever starts
restoring any registers, then we could have live-in reg vars...
*/
noway_assert((gcInfo.gcRegGCrefSetCur & RBM_EXCEPTION_OBJECT) == gcInfo.gcRegGCrefSetCur);
noway_assert(gcInfo.gcRegByrefSetCur == 0);
JITDUMP("Reserving funclet prolog IG for block " FMT_BB "\n", block->bbNum);
GetEmitter()->emitCreatePlaceholderIG(IGPT_FUNCLET_PROLOG, block, gcInfo.gcVarPtrSetCur, gcInfo.gcRegGCrefSetCur,
gcInfo.gcRegByrefSetCur, false);
} |
Thanks. I adjusted emitting the no-GC regions for funclet prologs/epilogs but I will look into this at some later point to see if we can improve the codegen. We no longer establish the frame pointer in funclet prologs on any platform so the reasoning makes little sense. |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
runtime/src/coreclr/vm/syncblk.cpp Line 2019 in d896e85
failed:
|
Yeah, I downloaded the artifacts for BestFitMapping to investigate. I didn't expect a clean run just yet but it's already looking better than the first attempt. |
The BestFitMapping is not like the other failures I have seen. The object is valid but it has a header pointing to sync block. The sync block table has 2 free entries. That's not really my area of expertise, so any advice is welcome. The other failure - Interop/COM/NETClients/IDispatch/NETClientIDispatch/NETClientIDispatch - was not reproducible on my machine last time it occurred. However, it does happen consistently on the CI machines. I'll have another look. |
Previous discussion at #74741.
Is it failing on x86? https://helix.dot.net/api/2019-06-17/jobs/05d45426-3dee-421e-a098-6215d618b5ce/workitems/Interop.0.1/console suggests it has passed. |
The failure log is here: https://helixr1107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-115957-merge-e222de0ef7c24526a4/Interop.0.1/1/console.0d3011b6.log It is failing at:
We are doing a GC in the middle of this filter:
My guess is that the following condition has something to do with it runtime/src/coreclr/vm/gc_unwind_x86.inl Lines 3675 to 3680 in d896e85
|
That sounds extremely plausible. It fails to skip over the table data for untracked variables. |
Interesting. It does sound awfully similar. The difference is that we are dealing with WKS GC here and that we are just about to mark the object, so the initial conditions are quite different. |
Failing at:
We just returned from fault funclet runtime/src/tests/JIT/Methodical/Boxing/seh/fault.il Lines 447 to 487 in d896e85
runtime/src/coreclr/nativeaot/Runtime.Base/src/System/Runtime/ExceptionHandling.cs Lines 1218 to 1237 in d896e85
The problem is that we are trying to report GC references based on the IP of call inside the try region (exception was thrown inside this call). These GC references are no longer valid runtime/src/coreclr/vm/gc_unwind_x86.inl Line 3412 in d896e85
Fault funclets should work exactly same as finally funclets. Fault funclets are rare since they are not generated by C#. Are we missing check for the fault funclets somewhere? |
Nvm, I see it now. |
I suspect the Methodical_r1 test may be failing because this code path is not implemented for runtime/src/coreclr/vm/gcenv.ee.common.cpp Lines 320 to 350 in d896e85
It would have been taken otherwise based on the state of the variables. |
Failure:
This assert is one of the typical GC hole symptoms. It is likely to be hit anytime we miss reporting an object with a sync block. The offending object in this case is a marshalled delegate. Marshalled delegates have syncblocks.
It likely means that the object was collected, but the space was not reused by the GC for a new object yet. We likely missed reporting the slot in some earlier GC but reporting it again. The callstack looks like this: BestFitMapping!ILStubClass.IL_STUB_PInvoke -> native code -> BestFitMapping!ILStubClass.IL_STUB_ReversePInvoke
Let's take a look what
This GC was triggered in the finally funclet inside |
BestFitMapping was fixed by b9cd5ac. Could you rerun gcstress? |
This was a fix for Interop/COM/NETClients/IDispatch/NETClientIDispatch/NETClientIDispatch. I do not think BestFitMapping is fixed. |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
I have a fix locally for Methodical_r1.0.1. I'll push it after going through more tests and once the current GCStress run finishes. UPD: Hmm, still needs some refining. UPD2: As usual, the refining consisted of actually saving a file before starting the compilation... (ac4373f) |
Switches JIT to the funclet model that is used by other platforms and VM to use the new exception handling introduced in .NET 9.
Fixes #113985