Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash during MTLDispatchListApply #225

Closed
maleadt opened this issue Jul 17, 2023 · 14 comments · Fixed by JuliaLang/julia#50871
Closed

Crash during MTLDispatchListApply #225

maleadt opened this issue Jul 17, 2023 · 14 comments · Fixed by JuliaLang/julia#50871

Comments

@maleadt
Copy link
Member

maleadt commented Jul 17, 2023

For example, https://buildkite.com/julialang/metal-dot-jl/builds/630#0189615c-46ea-4cb7-8b2c-32d5c4ecea1f. This happens on 1.9.2. Maybe related to #138?

@maleadt maleadt added the bug label Jul 17, 2023
@maleadt
Copy link
Member Author

maleadt commented Jul 31, 2023

This looks GC related. A MWE that somewhat reliably triggers the issue (after doing the @benchmark multiple times):

using Test, Metal, BenchmarkTools

function vadd(a, b, c)
    i = thread_position_in_grid_1d()
    @inbounds c[i] = a[i] + b[i]
    return
end

dims = (3,4)
a = round.(rand(Float32, dims) * 100)
b = round.(rand(Float32, dims) * 100)
c = similar(a)

d_a = MtlArray(a)
d_b = MtlArray(b)
d_c = MtlArray(c)

len = prod(dims)

using BenchmarkTools
f() = @metal threads=len vadd(d_a, d_b, d_c)
@benchmark f()

With GC.enable(false), it doesn't seem to reproduce.

@maleadt
Copy link
Member Author

maleadt commented Jul 31, 2023

Doesn't crash on 1.8, so I wonder if it's also related to thread adoption.
(EDIT: well, @objcblock is only used on 1.9+, so it may also not be)
(EDIT 2: it also crashes on 1.8)

FWIW, the back trace:

(lldb) bt
* thread #13, queue = 'com.Metal.CompletionQueueDispatch', stop reason = EXC_BAD_ACCESS (code=2, address=0x1000e8008)
  * frame #0: 0x000000015d05c208
    frame #1: 0x00000001af949730 Metal`-[_MTLCommandBuffer didCompleteWithStartTime:endTime:error:] + 524
    frame #2: 0x00000001c31fecc4 IOGPU`-[IOGPUMetalCommandBuffer didCompleteWithStartTime:endTime:error:] + 220
    frame #3: 0x00000001af9493c8 Metal`-[_MTLCommandQueue commandBufferDidComplete:startTime:completionTime:error:] + 108
    frame #4: 0x00000001c3208190 IOGPU`IOGPUNotificationQueueDispatchAvailableCompletionNotifications + 128
    frame #5: 0x00000001c320829c IOGPU`__IOGPUNotificationQueueSetDispatchQueue_block_invoke + 64
    frame #6: 0x00000001a63304c0 libdispatch.dylib`_dispatch_client_callout4 + 20
    frame #7: 0x00000001a634ced8 libdispatch.dylib`_dispatch_mach_msg_invoke + 468
    frame #8: 0x00000001a6337960 libdispatch.dylib`_dispatch_lane_serial_drain + 372
    frame #9: 0x00000001a634dc24 libdispatch.dylib`_dispatch_mach_invoke + 448
    frame #10: 0x00000001a6337960 libdispatch.dylib`_dispatch_lane_serial_drain + 372
    frame #11: 0x00000001a633862c libdispatch.dylib`_dispatch_lane_invoke + 436
    frame #12: 0x00000001a6337960 libdispatch.dylib`_dispatch_lane_serial_drain + 372
    frame #13: 0x00000001a63385f8 libdispatch.dylib`_dispatch_lane_invoke + 384
    frame #14: 0x00000001a6343244 libdispatch.dylib`_dispatch_workloop_worker_thread + 648
    frame #15: 0x00000001a64dc074 libsystem_pthread.dylib`_pthread_wqthread + 288

What I think the function that contains the faulting instruction is:

    0x15d05c1b8: stp    x24, x23, [sp, #-0x40]!
    0x15d05c1bc: stp    x22, x21, [sp, #0x10]
    0x15d05c1c0: stp    x20, x19, [sp, #0x20]
    0x15d05c1c4: stp    x29, x30, [sp, #0x30]
    0x15d05c1c8: mov    x19, x2
    0x15d05c1cc: mov    x20, x1
    0x15d05c1d0: mov    x22, x0
    0x15d05c1d4: mov    x8, #0xae10
    0x15d05c1d8: movk   x8, #0xa64d, lsl #16
    0x15d05c1dc: movk   x8, #0x1, lsl #32
    0x15d05c1e0: mov    w0, #0x105
    0x15d05c1e4: blr    x8
    0x15d05c1e8: cbz    x0, 0x15d05c290
    0x15d05c1ec: mov    x21, x0
    0x15d05c1f0: ldr    x8, [x0, #0x10]
    0x15d05c1f4: add    x9, x8, #0x19
    0x15d05c1f8: ldrb   w23, [x8, #0x19]
    0x15d05c1fc: stlrb  wzr, [x9]
    0x15d05c200: cbz    w23, 0x15d05c20c
    0x15d05c204: ldr    x8, [x8, #0x10]
->  0x15d05c208: ldr    xzr, [x8]
    0x15d05c20c: ldr    x24, [x21, #0x8]
    0x15d05c210: adrp   x8, 0
    0x15d05c214: ldr    x8, [x8, #0x820]
    0x15d05c218: ldar   x8, [x8]
    0x15d05c21c: mov    x9, #0x5ad8
    0x15d05c220: movk   x9, #0xdaa, lsl #16
    0x15d05c224: movk   x9, #0x1, lsl #32
    0x15d05c228: ldr    x9, [x9]
    0x15d05c22c: str    x8, [x21, #0x8]
    0x15d05c230: adrp   x10, 0
    0x15d05c234: ldr    x10, [x10, #0x7c8]
    0x15d05c238: adrp   x11, 0
    0x15d05c23c: ldr    x11, [x11, #0x7b0]
    0x15d05c240: cmp    x9, x8
    0x15d05c244: csel   x8, x11, x10, lo
    0x15d05c248: mov    x0, x22
    0x15d05c24c: mov    x1, x20
    0x15d05c250: mov    x2, x19
    0x15d05c254: blr    x8
    0x15d05c258: ldur   x8, [x0, #-0x8]
    0x15d05c25c: mov    w9, #0xc3dc
    0x15d05c260: movk   w9, #0x129a, lsl #16
    0x15d05c264: cmp    x9, x8, lsr #4
    0x15d05c268: b.ne   0x15d05c2a8
    0x15d05c26c: str    x24, [x21, #0x8]
    0x15d05c270: ldr    x8, [x21, #0x10]
    0x15d05c274: add    x8, x8, #0x19
    0x15d05c278: stlrb  w23, [x8]
    0x15d05c27c: ldp    x29, x30, [sp, #0x30]
    0x15d05c280: ldp    x20, x19, [sp, #0x20]
    0x15d05c284: ldp    x22, x21, [sp, #0x10]
    0x15d05c288: ldp    x24, x23, [sp], #0x40
    0x15d05c28c: ret

Sadly, we don't have a way to map this back to the Julia function (@vchuravy why again doesn't this work on macOS?)

(lldb) call (void)jl_gdblookup(0x15d05c1b8)
unknown function (ip: 0x15d05c1b8)
(lldb) call (void)jl_gdblookup(0x000000015d05c208)
unknown function (ip: 0x15d05c208)

I tried manually rooting all arguments that involve the onCompletion callback (i.e., the ObjectiveC block, arguments, closures, etc), to no avail.

@maleadt
Copy link
Member Author

maleadt commented Jul 31, 2023

The prologue in there:

 mov    x8, #0xae10
 movk   x8, #0xa64d, lsl #16
 movk   x8, #0x1, lsl #32
 mov    w0, #0x105

... or

mov     x8, #44560
movk    x8, #42573, lsl #16
movk    x8, #1, lsl #32
mov     w0, #261

... seems to match (44560 + 42573<<16 + 1<<32 == 7085075984) that of a Julia function:

%2 = call {}*** inttoptr (i64 7085075984 to {}*** (i64)*)(i64 261) #6
julia> ccall(:jl_gdblookup, Nothing, (Ptr{Cvoid},), Ptr{Cvoid}(7085075984))
pthread_getspecific at /usr/lib/system/libsystem_pthread.dylib (unknown line)

... where pthread_getspecific is called on macOS for get_pgcstack.

So this definitely seems like a Julia function we're crashing in. Annoying I can't look up which one; that would help tremendously.

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

lo and behold, I got a backtrace (not sure how, can't reproduce):

[72576] signal (10.1): Bus error: 10
in expression starting at REPL[14]:1
getproperty at /Users/tim/Julia/pkg/ObjectiveC/src/syntax.jl:474
unsafe_convert at /Users/tim/Julia/pkg/ObjectiveC/src/syntax.jl:272 [inlined]
macro expansion at /Users/tim/Julia/pkg/ObjectiveC/src/syntax.jl:163 [inlined]
release at /Users/tim/Julia/pkg/ObjectiveC/src/foundation.jl:42 [inlined]
macro expansion at /Users/tim/Julia/pkg/Metal/src/pool.jl:81 [inlined]
macro expansion at ./timing.jl:395 [inlined]
free at /Users/tim/Julia/pkg/Metal/src/pool.jl:80
unknown function (ip: 0x1311c0203)
_jl_invoke at /Users/tim/Julia/src/julia/src/gf.c:2870
ijl_apply_generic at /Users/tim/Julia/src/julia/src/gf.c:3071
foreach at ./abstractarray.jl:3085
#57 at /Users/tim/Julia/pkg/Metal/src/compiler/execution.jl:247 [inlined]
g at /Users/tim/Julia/pkg/Metal/lib/mtl/command_buf.jl:188
unknown function (ip: 0x16faa05d3)
_jl_invoke at /Users/tim/Julia/src/julia/src/gf.c:2870
ijl_apply_generic at /Users/tim/Julia/src/julia/src/gf.c:3071
julia_block_trampoline at /Users/tim/Julia/pkg/ObjectiveC/src/blocks.jl:103
unknown function (ip: 0x16f954267)
MTLDispatchListApply at /System/Library/Frameworks/Metal.framework/Versions/A/Metal (unknown line)
Allocations: 41513105 (Pool: 41333891; Big: 179214); GC: 49
ERROR: Exception handler triggered on unmanaged thread.
zsh: bus error  ./julia --pkgimages=no

That getproperty function is pretty complicated, and performs allocations. So maybe it hits a safepoint again?

EDIT: the reason I got a backtrace here is probably because most of the time this crashes during the safepoint load as part of the cfunction trampoline, but in some cases it may happen that it happens during a later safepoint load, when executing regular Julia code that does have more debug info.

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

Also encountered this:

ERROR: Exception handler triggered on unmanaged thread.

[3482] signal (10.1): Bus error: 10
in expression starting at REPL[14]:1
jl_gc_state_set at /Users/tim/Julia/src/julia/src/julia_threads.h:351 [inlined]
jl_gc_state_save_and_set at /Users/tim/Julia/src/julia/src/julia_threads.h:357 [inlined]
jl_delete_thread at /Users/tim/Julia/src/julia/src/threading.c:447
_pthread_tsd_cleanup at /usr/lib/system/libsystem_pthread.dylib (unknown line)
Allocations: 73438743 (Pool: 73420590; Big: 18153); GC: 36

That's on release-1.10, where we try to jl_gc_unsafe_enter during jl_delete_thread. I'm confused why that crashes; in every scenario that thread should have signal handlers set-up to deal with the safepoint exception?

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

OK yeah this is just the safepoint segfault not being caught by our handlers:

(lldb) bt
* thread #15, queue = 'com.Metal.CompletionQueueDispatch', stop reason = EXC_BAD_ACCESS (code=2, address=0x100208008)

(lldb) call ((jl_ptls_t)jl_get_ptls_states())->safepoint
(volatile size_t *) $12 = 0x0000000100208008

Adding some instrumentation to jl_thread_adopt and jl_delete_thread, it looks like we're segfaulting on an adopted thread:

adopting thread 9363502
ERROR: Exception handler triggered on unmanaged thread.

(lldb) expr unsigned long long tid; (void)pthread_threadid_np(NULL, &tid); tid
(unsigned long long) $1 = 9363502

It also looks like the GC is running at the time of the fault:

(lldb) p jl_gc_disable_counter
(uint32_t) $5 = 0
(lldb) p jl_gc_running
(uint32_t) $6 = 1
(lldb) call ((jl_ptls_t)jl_get_ptls_states())->gc_state
(int8_t) $9 = '\0'

All the other threads have reached a safepoint, but we haven't:

(lldb) p gc_n_threads
(int) $20 = 8

(lldb) p gc_all_tls_states[0]->gc_state
(int8_t) $13 = '\x01'
(lldb) p gc_all_tls_states[1]->gc_state
(int8_t) $14 = '\x02'
(lldb) p gc_all_tls_states[2]->gc_state
(int8_t) $15 = '\x02'
(lldb) p gc_all_tls_states[3]->gc_state
(int8_t) $16 = '\x02'
(lldb) p gc_all_tls_states[4]->gc_state
(int8_t) $17 = '\x02'
(lldb) p gc_all_tls_states[5]->gc_state
(int8_t) $18 = '\x02'
(lldb) p gc_all_tls_states[6]->gc_state
(int8_t) $19 = '\0'
(lldb) p gc_all_tls_states[7]->gc_state
(int8_t) $21 = '\x02'

(lldb) call ((jl_ptls_t)jl_get_ptls_states())->tid
(int16_t) $10 = 6

So I guess there's still something wrong with adopted threads and safepoint. While JuliaLang/julia#49934 fixed the crash when jl_adopt_thread triggered a safepoint, now waiting for the GC to finish instead, it looks like subsequent code executing on the adopted thread is still susceptible to a visible crash when accessing the safepoint.

FWIW, the disassembly of this function:

(lldb) disassemble --start-address 0x168e341b8 --end-address 0x168e343d4
    0x168e341b8: stp    x24, x23, [sp, #-0x40]!
    0x168e341bc: stp    x22, x21, [sp, #0x10]
    0x168e341c0: stp    x20, x19, [sp, #0x20]
    0x168e341c4: stp    x29, x30, [sp, #0x30]
    0x168e341c8: mov    x19, x2
    0x168e341cc: mov    x21, x1
    0x168e341d0: mov    x22, x0
    0x168e341d4: mov    x8, #0xae10
    0x168e341d8: movk   x8, #0xa64d, lsl #16
    0x168e341dc: movk   x8, #0x1, lsl #32
    0x168e341e0: mov    w0, #0x105
    0x168e341e4: blr    x8
    0x168e341e8: cbz    x0, 0x168e34290
    0x168e341ec: mov    x20, x0
    0x168e341f0: ldr    x8, [x0, #0x10]
    0x168e341f4: add    x9, x8, #0x19
    0x168e341f8: ldrb   w23, [x8, #0x19]
    0x168e341fc: stlrb  wzr, [x9]
    0x168e34200: cbz    w23, 0x168e3420c
    0x168e34204: ldr    x8, [x8, #0x10]
->  0x168e34208: ldr    xzr, [x8]
    0x168e3420c: ldr    x24, [x20, #0x8]
    0x168e34210: adrp   x8, 0
    0x168e34214: ldr    x8, [x8, #0x770]
    0x168e34218: ldar   x8, [x8]
    0x168e3421c: mov    x9, #0x5138
    0x168e34220: movk   x9, #0x47f, lsl #16
    0x168e34224: movk   x9, #0x1, lsl #32
    0x168e34228: ldr    x9, [x9]
    0x168e3422c: str    x8, [x20, #0x8]
    0x168e34230: adrp   x10, 0
    0x168e34234: ldr    x10, [x10, #0x758]
    0x168e34238: adrp   x11, 0
    0x168e3423c: ldr    x11, [x11, #0x760]
    0x168e34240: cmp    x9, x8
    0x168e34244: csel   x8, x11, x10, lo
    0x168e34248: mov    x0, x22
    0x168e3424c: mov    x1, x21
    0x168e34250: mov    x2, x19
    0x168e34254: blr    x8
    0x168e34258: ldur   x8, [x0, #-0x8]
    0x168e3425c: mov    w9, #0xfe5e
    0x168e34260: movk   w9, #0x12ed, lsl #16
    0x168e34264: cmp    x9, x8, lsr #4
    0x168e34268: b.ne   0x168e342a8
    0x168e3426c: str    x24, [x20, #0x8]
    0x168e34270: ldr    x8, [x20, #0x10]
    0x168e34274: add    x8, x8, #0x19
    0x168e34278: stlrb  w23, [x8]
    0x168e3427c: ldp    x29, x30, [sp, #0x30]
    0x168e34280: ldp    x20, x19, [sp, #0x20]
    0x168e34284: ldp    x22, x21, [sp, #0x10]
    0x168e34288: ldp    x24, x23, [sp], #0x40
    0x168e3428c: ret
    0x168e34290: adrp   x8, 0
    0x168e34294: ldr    x8, [x8, #0x740]
    0x168e34298: blr    x8
    0x168e3429c: mov    x20, x0
    0x168e342a0: mov    w23, #0x2
    0x168e342a4: b      0x168e3420c
    0x168e342a8: mov    x2, x0
    0x168e342ac: adrp   x0, 0
    0x168e342b0: ldr    x0, [x0, #0x7a0]
    0x168e342b4: adrp   x8, 0
    0x168e342b8: ldr    x8, [x8, #0x788]
    0x168e342bc: mov    x1, #0xe5e0
    0x168e342c0: movk   x1, #0x2edf, lsl #16
    0x168e342c4: movk   x1, #0x1, lsl #32
    0x168e342c8: blr    x8
    0x168e342cc: sub    sp, sp, #0x80
    0x168e342d0: stp    x24, x23, [sp, #0x40]
    0x168e342d4: stp    x22, x21, [sp, #0x50]
    0x168e342d8: stp    x20, x19, [sp, #0x60]
    0x168e342dc: stp    x29, x30, [sp, #0x70]
    0x168e342e0: mov    x19, x2
    0x168e342e4: mov    x21, x1
    0x168e342e8: mov    x22, x0
    0x168e342ec: movi.2d v0, #0000000000000000
    0x168e342f0: str    q0, [sp]
    0x168e342f4: str    xzr, [sp, #0x20]
    0x168e342f8: str    q0, [sp, #0x10]
    0x168e342fc: mov    w8, #0xc
    0x168e34300: ldr    x9, [x20]
    0x168e34304: stp    x8, x9, [sp]
    0x168e34308: mov    x8, sp
    0x168e3430c: str    x8, [x20]
    0x168e34310: ldr    x0, [x20, #0x10]
    0x168e34314: adrp   x24, 0
    0x168e34318: ldr    x24, [x24, #0x7d0]
    0x168e3431c: mov    w1, #0x470
    0x168e34320: mov    w2, #0x10
    0x168e34324: blr    x24
    0x168e34328: mov    x23, x0
    0x168e3432c: mov    x8, #0x3e10
    0x168e34330: movk   x8, #0x5fd9, lsl #16
    0x168e34334: movk   x8, #0x1, lsl #32
    0x168e34338: stp    x8, x22, [x0, #-0x8]
    0x168e3433c: str    x0, [sp, #0x20]
    0x168e34340: ldr    x0, [x20, #0x10]
    0x168e34344: mov    w1, #0x470
    0x168e34348: mov    w2, #0x10
    0x168e3434c: blr    x24
    0x168e34350: mov    x22, x0
    0x168e34354: mov    x8, #0xd890
    0x168e34358: movk   x8, #0x3816, lsl #16
    0x168e3435c: movk   x8, #0x1, lsl #32
    0x168e34360: stp    x8, x21, [x0, #-0x8]
    0x168e34364: str    x0, [sp, #0x18]
    0x168e34368: ldr    x0, [x20, #0x10]
    0x168e3436c: mov    w1, #0x470
    0x168e34370: mov    w2, #0x10
    0x168e34374: blr    x24
    0x168e34378: mov    x8, #0x88a0
    0x168e3437c: movk   x8, #0x385b, lsl #16
    0x168e34380: movk   x8, #0x1, lsl #32
    0x168e34384: stp    x8, x19, [x0, #-0x8]
    0x168e34388: str    x0, [sp, #0x10]
    0x168e3438c: stp    x23, x22, [sp, #0x28]
    0x168e34390: str    x0, [sp, #0x38]
    0x168e34394: adrp   x8, 0
    0x168e34398: ldr    x8, [x8, #0x7c0]
    0x168e3439c: add    x1, sp, #0x28
    0x168e343a0: mov    x0, #0xbf30
    0x168e343a4: movk   x0, #0x3817, lsl #16
    0x168e343a8: movk   x0, #0x1, lsl #32
    0x168e343ac: mov    w2, #0x3
    0x168e343b0: blr    x8
    0x168e343b4: ldr    x8, [sp, #0x8]
    0x168e343b8: str    x8, [x20]
    0x168e343bc: ldp    x29, x30, [sp, #0x70]
    0x168e343c0: ldp    x20, x19, [sp, #0x60]
    0x168e343c4: ldp    x22, x21, [sp, #0x50]
    0x168e343c8: ldp    x24, x23, [sp, #0x40]
    0x168e343cc: add    sp, sp, #0x80
    0x168e343d0: ret

@vchuravy @vtjnash Any thoughts?

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

Can't seem to reproduce this in isolation though. I tried:

using pthreads

mutable struct ListNode
  key::Int64
  next::ListNode
  ListNode() = new()
  ListNode(x)= new(x)
  ListNode(x,y) = new(x,y);
end

function list(n=32)
    start::ListNode = ListNode(1)
    current::ListNode = start
    for i = 2:(n*1024^2)
        current = ListNode(i,current)
    end
    return current.key
end

function thread()
    println("Thread starts")
    ccall(:usleep, Cint, (Cint,), 100000)
    println("Alloc")
    arr = Array{Int}(undef, 1024*1024)
    GC.@preserve arr begin
        println("Thread done")
    end
    return 42
end

function main()
    # create some garbage
    x = list()
    GC.@preserve x begin
        println("Got garbage")

        @time GC.gc(true)

        thread = pthread() do
            println("Thread starts")
            ccall(:usleep, Cint, (Cint,), 100000)
            println("Alloc")
            arr = Array{Int}(undef, 1024)
            GC.@preserve arr begin
                println("Thread done")
            end
            return 42
        end
        sleep(0.1)

        println("Start GC")
        GC.gc(true)
        println("End GC")

        wait(thread)
    end
end

main()

... which triggers an allocation from an adopted thread while the GC is active:

Got garbage
  0.836207 seconds (100.00% gc time)
adopting thread 9435226
Thread starts
Start GC
Alloc
Thread done
End GC
deleting thread 9435226

... but doesn't crash.

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

The signal handler also seems set-up still:

(lldb) expr struct sigaction act; (int)sigaction(10, NULL, &act); act
(struct sigaction) $14 = {
  __sigaction_u = {
    __sa_handler = 0x000000010187ca88 (libjulia-internal.1.10.0.dylib`segv_handler at signals-mach.c:237)
    __sa_sigaction = 0x000000010187ca88 (libjulia-internal.1.10.0.dylib`segv_handler at signals-mach.c:237)
  }
  sa_mask = 0
  sa_flags = 65  ( == SA_ONSTACK | SA_SIGINFO)
}

... so it doesn't like Metal messed with our signal handlers.

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

Another breadcrumb; it doesn't seem like Metal is resetting our signal handlers, as per above but also because I couldn't reproduce that in isolation:

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <pthread.h>
#import <signal.h>
#import <setjmp.h>

static jmp_buf return_to_this_point;

void segfault_sigaction(int signal, siginfo_t *si, void *arg)
{
    NSLog(@"Caught segfault at address %p", si->si_addr);
    longjmp(return_to_this_point, 1);
}

void install_segfault_handler() {
    struct sigaction sa;

    sa.sa_flags = SA_SIGINFO;
    sigemptyset(&sa.sa_mask);
    sa.sa_sigaction = segfault_sigaction;

    if (sigaction(SIGSEGV, &sa, NULL) == -1)
        NSLog(@"sigaction");
    if (sigaction(SIGBUS, &sa, NULL) == -1)
        NSLog(@"sigaction");
}

@interface MetalHandler : NSObject
@property (nonatomic, strong) id<MTLDevice> device;
@property (nonatomic, strong) id<MTLCommandQueue> commandQueue;
@property (nonatomic, strong) NSMutableDictionary<NSNumber *, NSNull *> *threadsSeen;
@end

@implementation MetalHandler
- (instancetype)init
{
    self = [super init];
    if (self) {
        _device = MTLCreateSystemDefaultDevice();
        if (!_device) {
            NSLog(@"Metal is not supported on this device");
            return nil;
        }
        _commandQueue = [_device newCommandQueue];
        _threadsSeen = [[NSMutableDictionary alloc] init];
    }
    return self;
}

- (void)launchCommandBuffer {
    pthread_t current_thread = pthread_self();
    uint64_t current_tid;
    pthread_threadid_np(NULL, &current_tid);

    NSLog(@"Launching command buffer from thread: %llu", current_tid);

    id<MTLCommandBuffer> commandBuffer = [self.commandQueue commandBuffer];
    [commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> _Nonnull cmdBuffer) {
        pthread_t thread = pthread_self();
        uint64_t tid;
        pthread_threadid_np(NULL, &tid);
        NSLog(@"Completed command buffer in thread: %llu", tid);

        if (!self.threadsSeen[@(tid)]) {
            NSLog(@"New thread detected, installing segfault handler.");
            install_segfault_handler();
            self.threadsSeen[@(tid)] = [NSNull null];
        }

        // Trigger segfault
        NSLog(@"Triggering segfault");
        if (!setjmp(return_to_this_point)) {
            int *foo = (int*)-1; // make a bad pointer
            NSLog(@"%d", *foo); // causes segfault
        } else {
            NSLog(@"Jumped back to safety");
        }
    }];

    [commandBuffer commit];
    [commandBuffer waitUntilCompleted];
}

@end

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        MetalHandler *handler = [[MetalHandler alloc] init];
        for (int i = 0; i < 300; ++i) {
            [handler launchCommandBuffer];
        }
    }
    return 0;
}

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

Oh wait, I'm debugging this all wrong; I'm probably catching the signal here before it was even delivered by our handlers, and the bug is actually with the handler (reporting an exception instead of treating it as a safepoint).

EDIT: apparently on macOS the signal handler happens first (or at least catch_mach_exception_raise does). But that doesn't mean the issue isn't with the handler.

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

We're triggering this: https://github.com/JuliaLang/julia/blob/ec8df3da3597d0acd503ff85ac84a5f8f73f625b/src/signals-mach.c#L287-L293

The adopted thread is actually found in jl_all_tls_states, but its current_task is not set. Simply relaxing that check (i.e., moving it down to where current_task is actually used, which is after safepoints are checked) doesn't seem to work, as Julia then hangs when waiting for threads to reach a safepoint during the next GC.

@maleadt
Copy link
Member Author

maleadt commented Aug 1, 2023

Alright, I think I'm getting to the root of this. It looks like our task list has multiple entries with the same system ID, so that messes up the matching during signal handling:

(lldb) p (int)pthread_mach_thread_np(jl_all_tls_states[3]->system_id)
(int) $23 = 21003
(lldb) p (int)pthread_mach_thread_np(jl_all_tls_states[8]->system_id)
(int) $22 = 21003

maleadt added a commit to JuliaLang/julia that referenced this issue Aug 1, 2023
On macOS, we use the system thread ID to match against the list of known
thread local states during signal handling. To prevent picking up the
wrong entry, i.e. from when a thread was previously executing a
different task, make sure to wipe the system ID when a thread exits.

This manifested as the signal handler actually reporting a bus error
when a thread touched safepoint memory during GC, because the matched
thread local state had no current task attached to it.

Fixes JuliaGPU/Metal.jl#225
@vchuravy
Copy link
Member

vchuravy commented Aug 1, 2023

Great sleuthing.

maleadt added a commit to JuliaLang/julia that referenced this issue Aug 10, 2023
On macOS, we use the system thread ID to match against the list of known
thread local states during signal handling. To prevent picking up the
wrong entry, i.e. from when a thread was previously executing a
different task, make sure to wipe the system ID when a thread exits.

This manifested as the signal handler actually reporting a bus error
when a thread touched safepoint memory during GC, because the matched
thread local state had no current task attached to it.

Fixes JuliaGPU/Metal.jl#225
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants