-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash during MTLDispatchListApply #225
Comments
This looks GC related. A MWE that somewhat reliably triggers the issue (after doing the using Test, Metal, BenchmarkTools
function vadd(a, b, c)
i = thread_position_in_grid_1d()
@inbounds c[i] = a[i] + b[i]
return
end
dims = (3,4)
a = round.(rand(Float32, dims) * 100)
b = round.(rand(Float32, dims) * 100)
c = similar(a)
d_a = MtlArray(a)
d_b = MtlArray(b)
d_c = MtlArray(c)
len = prod(dims)
using BenchmarkTools
f() = @metal threads=len vadd(d_a, d_b, d_c)
@benchmark f() With |
Doesn't crash on 1.8, so I wonder if it's also related to thread adoption. FWIW, the back trace:
What I think the function that contains the faulting instruction is:
Sadly, we don't have a way to map this back to the Julia function (@vchuravy why again doesn't this work on macOS?)
I tried manually rooting all arguments that involve the onCompletion callback (i.e., the ObjectiveC block, arguments, closures, etc), to no avail. |
The prologue in there:
... or
... seems to match (44560 + 42573<<16 + 1<<32 == 7085075984) that of a Julia function: %2 = call {}*** inttoptr (i64 7085075984 to {}*** (i64)*)(i64 261) #6
... where So this definitely seems like a Julia function we're crashing in. Annoying I can't look up which one; that would help tremendously. |
lo and behold, I got a backtrace (not sure how, can't reproduce):
That EDIT: the reason I got a backtrace here is probably because most of the time this crashes during the safepoint load as part of the cfunction trampoline, but in some cases it may happen that it happens during a later safepoint load, when executing regular Julia code that does have more debug info. |
Also encountered this:
That's on |
OK yeah this is just the safepoint segfault not being caught by our handlers:
Adding some instrumentation to
It also looks like the GC is running at the time of the fault:
All the other threads have reached a safepoint, but we haven't:
So I guess there's still something wrong with adopted threads and safepoint. While JuliaLang/julia#49934 fixed the crash when FWIW, the disassembly of this function:
|
Can't seem to reproduce this in isolation though. I tried: using pthreads
mutable struct ListNode
key::Int64
next::ListNode
ListNode() = new()
ListNode(x)= new(x)
ListNode(x,y) = new(x,y);
end
function list(n=32)
start::ListNode = ListNode(1)
current::ListNode = start
for i = 2:(n*1024^2)
current = ListNode(i,current)
end
return current.key
end
function thread()
println("Thread starts")
ccall(:usleep, Cint, (Cint,), 100000)
println("Alloc")
arr = Array{Int}(undef, 1024*1024)
GC.@preserve arr begin
println("Thread done")
end
return 42
end
function main()
# create some garbage
x = list()
GC.@preserve x begin
println("Got garbage")
@time GC.gc(true)
thread = pthread() do
println("Thread starts")
ccall(:usleep, Cint, (Cint,), 100000)
println("Alloc")
arr = Array{Int}(undef, 1024)
GC.@preserve arr begin
println("Thread done")
end
return 42
end
sleep(0.1)
println("Start GC")
GC.gc(true)
println("End GC")
wait(thread)
end
end
main() ... which triggers an allocation from an adopted thread while the GC is active:
... but doesn't crash. |
The signal handler also seems set-up still:
... so it doesn't like Metal messed with our signal handlers. |
Another breadcrumb; it doesn't seem like Metal is resetting our signal handlers, as per above but also because I couldn't reproduce that in isolation: #import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <pthread.h>
#import <signal.h>
#import <setjmp.h>
static jmp_buf return_to_this_point;
void segfault_sigaction(int signal, siginfo_t *si, void *arg)
{
NSLog(@"Caught segfault at address %p", si->si_addr);
longjmp(return_to_this_point, 1);
}
void install_segfault_handler() {
struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = segfault_sigaction;
if (sigaction(SIGSEGV, &sa, NULL) == -1)
NSLog(@"sigaction");
if (sigaction(SIGBUS, &sa, NULL) == -1)
NSLog(@"sigaction");
}
@interface MetalHandler : NSObject
@property (nonatomic, strong) id<MTLDevice> device;
@property (nonatomic, strong) id<MTLCommandQueue> commandQueue;
@property (nonatomic, strong) NSMutableDictionary<NSNumber *, NSNull *> *threadsSeen;
@end
@implementation MetalHandler
- (instancetype)init
{
self = [super init];
if (self) {
_device = MTLCreateSystemDefaultDevice();
if (!_device) {
NSLog(@"Metal is not supported on this device");
return nil;
}
_commandQueue = [_device newCommandQueue];
_threadsSeen = [[NSMutableDictionary alloc] init];
}
return self;
}
- (void)launchCommandBuffer {
pthread_t current_thread = pthread_self();
uint64_t current_tid;
pthread_threadid_np(NULL, ¤t_tid);
NSLog(@"Launching command buffer from thread: %llu", current_tid);
id<MTLCommandBuffer> commandBuffer = [self.commandQueue commandBuffer];
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> _Nonnull cmdBuffer) {
pthread_t thread = pthread_self();
uint64_t tid;
pthread_threadid_np(NULL, &tid);
NSLog(@"Completed command buffer in thread: %llu", tid);
if (!self.threadsSeen[@(tid)]) {
NSLog(@"New thread detected, installing segfault handler.");
install_segfault_handler();
self.threadsSeen[@(tid)] = [NSNull null];
}
// Trigger segfault
NSLog(@"Triggering segfault");
if (!setjmp(return_to_this_point)) {
int *foo = (int*)-1; // make a bad pointer
NSLog(@"%d", *foo); // causes segfault
} else {
NSLog(@"Jumped back to safety");
}
}];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
}
@end
int main(int argc, const char * argv[]) {
@autoreleasepool {
MetalHandler *handler = [[MetalHandler alloc] init];
for (int i = 0; i < 300; ++i) {
[handler launchCommandBuffer];
}
}
return 0;
} |
Oh wait, I'm debugging this all wrong; I'm probably catching the signal here before it was even delivered by our handlers, and the bug is actually with the handler (reporting an exception instead of treating it as a safepoint). EDIT: apparently on macOS the signal handler happens first (or at least |
We're triggering this: https://github.com/JuliaLang/julia/blob/ec8df3da3597d0acd503ff85ac84a5f8f73f625b/src/signals-mach.c#L287-L293 The adopted thread is actually found in |
Alright, I think I'm getting to the root of this. It looks like our task list has multiple entries with the same system ID, so that messes up the matching during signal handling:
|
On macOS, we use the system thread ID to match against the list of known thread local states during signal handling. To prevent picking up the wrong entry, i.e. from when a thread was previously executing a different task, make sure to wipe the system ID when a thread exits. This manifested as the signal handler actually reporting a bus error when a thread touched safepoint memory during GC, because the matched thread local state had no current task attached to it. Fixes JuliaGPU/Metal.jl#225
Great sleuthing. |
On macOS, we use the system thread ID to match against the list of known thread local states during signal handling. To prevent picking up the wrong entry, i.e. from when a thread was previously executing a different task, make sure to wipe the system ID when a thread exits. This manifested as the signal handler actually reporting a bus error when a thread touched safepoint memory during GC, because the matched thread local state had no current task attached to it. Fixes JuliaGPU/Metal.jl#225
For example, https://buildkite.com/julialang/metal-dot-jl/builds/630#0189615c-46ea-4cb7-8b2c-32d5c4ecea1f. This happens on 1.9.2. Maybe related to #138?
The text was updated successfully, but these errors were encountered: