Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realm: gex CRC MISMATCH #1779

Open
syamajala opened this issue Oct 21, 2024 · 13 comments
Open

Realm: gex CRC MISMATCH #1779

syamajala opened this issue Oct 21, 2024 · 13 comments

Comments

@syamajala
Copy link
Contributor

I'm seeing the following error at shutdown when running cunumeric on Perlmutter:

[1 - 7f3eef258740]  194.063976 {6}{gex}: CRC MISMATCH: arg0=54 header_size=36 payload_size=10976 exp=fc88a3b5 act=d1f06
46a
@eddy16112
Copy link
Contributor

Could you please get a backtrace?

@syamajala
Copy link
Contributor Author

I cant seem to get backtraces with a debug build and GASNET_BACKTRACE=1 its just saying GASNet abnormal exit.

@eddy16112
Copy link
Contributor

Could you please try RelwithDebInfo? I would like to see which message triggers the error.

@syamajala
Copy link
Contributor Author

Here is a stacktrace in debug: http://sapling2.stanford.edu/~seshu/xcsl1028423/backtrace.txt

@qldnfox
Copy link

qldnfox commented Oct 22, 2024

It says 404 Forbidden for me. Can I get access?

@syamajala
Copy link
Contributor Author

Try it now.

@syamajala
Copy link
Contributor Author

This is probably the relevant stack trace:

[5] Thread 70 (Thread 0x7fa2f0ebb740 (LWP 1499908) "python"):
[5] #0  0x00007fa44f15dbbf in wait4 () from /lib64/libc.so.6
[5] #1  0x00007fa44f0d4c37 in do_system () from /lib64/libc.so.6
[5] #2  0x00007fa3a7426964 in gasneti_system_redirected () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #3  0x00007fa3a7426fb7 in gasneti_bt_gdb () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #4  0x00007fa3a742a95e in gasneti_print_backtrace () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #5  0x00007fa3a63efe45 in gasneti_defaultSignalHandler () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[5] #6  <signal handler called>
[5] #7  0x00007fa44f0c6d2b in raise () from /lib64/libc.so.6
[5] #8  0x00007fa44f0c83e5 in abort () from /lib64/libc.so.6
[5] #9  0x00007fa3a6a20bc3 in Realm::XmitSrcDestPair::reserve_pbuf_inline (this=0x559a6ba88f00, hdr_bytes=36, payload_bytes=10976, overflow_ok=true, pktbuf=@0x559a6bac94f8: 0x0, pktidx=@0x559a6bac9500: -1, hdr_base=@0x7fa2f0eb8548: 0x7fa2f0eb85c0, payload_base=@0x7fa2f0eb8550: 0x0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_internal.cc:1258
[5] #10 0x00007fa3a6a2adf7 in Realm::GASNetEXInternal::prepare_message (this=0x559a66cefff0, target=0, target_ep_index=0, msgid=54, header_base=@0x7fa2f0eb8548: 0x7fa2f0eb85c0, header_size=36, payload_base=@0x7fa2f0eb8550: 0x0, payload_size=10976, dest_payload_addr=0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_internal.cc:3927
[5] #11 0x00007fa3a6a1a647 in Realm::GASNetEXMessageImpl::GASNetEXMessageImpl (this=0x7fa2f0eb8540, _internal=0x559a66cefff0, _target=0, _msgid=54, _header_size=36, _max_payload_size=10976, _src_payload_addr=0x0, _src_payload_lines=0, _src_payload_line_stride=0, _dest_payload_addr=0, _dest_ep_index=0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_module.cc:221
[5] #12 0x00007fa3a6a1c547 in Realm::GASNetEXModule::create_active_message_impl (this=0x559a6662e6f0, target=0, msgid=54, header_size=36, max_payload_size=10976, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7fa2f0eb8540, storage_size=256) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/gasnetex/gasnetex_module.cc:670
[5] #13 0x00007fa3a6418fb5 in Realm::Network::create_active_message_impl (target=0, msgid=54, header_size=32, max_payload_size=10976, src_payload_addr=0x0, src_payload_lines=0, src_payload_line_stride=0, storage_base=0x7fa2f0eb8540, storage_size=256) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/network.inl:100
[5] #14 0x00007fa3a67f9986 in Realm::ActiveMessage<Realm::BarrierTriggerMessage, 256ul>::init (this=0x7fa2f0eb8520, _target=0, _max_payload_size=10976) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.inl:53
[5] #15 0x00007fa3a67f7aec in Realm::ActiveMessage<Realm::BarrierTriggerMessage, 256ul>::ActiveMessage (this=0x7fa2f0eb8520, _target=0, _max_payload_size=10976) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.inl:44
[5] #16 0x00007fa3a67f2eba in Realm::BarrierTriggerMessage::send_request (target=0, barrier_id=2305930970166984704, trigger_gen=382, previous_gen=39, first_generation=0, redop_id=1048576, migration_target=-1, base_arrival_count=6, data=0x7f7fe45bc760, datalen=10976) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/barrier_impl.cc:288
[5] #17 0x00007fa3a67f4825 in Realm::BarrierImpl::adjust_arrival (this=0x7f81ac043dd0, barrier_gen=40, delta=-1, timestamp=0, wait_on=..., sender=0, forwarded=false, reduce_value=0x7f824a237d80, reduce_value_size=32, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/barrier_impl.cc:688
[5] #18 0x00007fa3a67f2c09 in Realm::BarrierAdjustMessage::handle_message (sender=0, args=..., data=0x7f824a237d80, datalen=32, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/barrier_impl.cc:249
[5] #19 0x00007fa3a67f694c in Realm::HandlerWrappers::wrap_handler<Realm::BarrierAdjustMessage, Realm::BarrierAdjustMessage::handle_message> (sender=0, header=0x7f824a237d50, payload=0x7f824a237d80, payload_size=32, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.inl:620
[5] #20 0x00007fa3a6a38dc7 in Realm::IncomingMessageManager::do_work (this=0x559a67f48a20, work_until=...) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/activemsg.cc:740
[5] #21 0x00007fa3a67d6663 in Realm::BackgroundWorkManager::Worker::do_work (this=0x7fa2f0eba0d0, max_time_in_ns=-1, interrupt_flag=0x0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/bgwork.cc:600
[5] #22 0x00007fa3a67d4301 in Realm::BackgroundWorkThread::main_loop (this=0x559a6b84b910) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/bgwork.cc:103
[5] #23 0x00007fa3a67d7902 in Realm::Thread::thread_entry_wrapper<Realm::BackgroundWorkThread, &Realm::BackgroundWorkThread::main_loop> (obj=0x559a6b84b910) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/threads.inl:97
[5] #24 0x00007fa3a6986a21 in Realm::KernelThread::pthread_entry (data=0x559a664d9dc0) at /pscratch/sd/s/seshu/cunumeric_build/legion/runtime/realm/threads.cc:854
[5] #25 0x00007fa44f3d46ea in start_thread () from /lib64/libpthread.so.0
[5] #26 0x00007fa44f19449f in clone () from /lib64/libc.so.6

@qldnfox
Copy link

qldnfox commented Oct 22, 2024

worked! thanks Seshu

@lightsighter
Copy link
Contributor

The backtrace doesn't look like it is from a CRC check. Looks like is coming from this error message:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L1252-1259

That's probably a bug in Realm where it is trying to send a medium active mssage where it needs to switch to sending a long active message. It probably also explains the CRC check failure on the far side because only some of the payload makes it across in release mode.

@eddy16112
Copy link
Contributor

I think it is the gasnet version of issue #1769. I am surprised that the limit is only 8K

[5] [5 - 7fa2f0ebb740]  543.639675 {6}{gexxpair}: medium payload too large!  src=5/0 tgt=0/0 max=8192 act=10976

@lightsighter
Copy link
Contributor

I agree that is a likely cause of the problem.

@elliottslaughter
Copy link
Contributor

The medium AM size can be set with GASNET_OFI_MAX_MEDIUM if you are on the ofi conduit: https://gasnet.lbl.gov/dist-ex/ofi-conduit/README

Don't even need to rebuild, it's just an environment variable.

We've seen this issue before: #1449 and #1229 are both variations on this same issue.

@syamajala
Copy link
Contributor Author

The CRC error goes away in release by setting GASNET_OFI_MAX_MEDIUM but its still not shutting down cleanly.

I just see gasnet abnormal exit. Here is a stack trace:

[13] Thread 1 (Thread 0x7f5ad3659740 (LWP 1843020) "python"):
[13] #0  0x00007f5ad373dbbf in wait4 () from /lib64/libc.so.6
[13] #1  0x00007f5ad36b4c37 in do_system () from /lib64/libc.so.6
[13] #2  0x00007f5a2f0ba654 in gasneti_system_redirected () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #3  0x00007f5a2f0baca7 in gasneti_bt_gdb () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #4  0x00007f5a2f0be64e in gasneti_print_backtrace () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #5  0x00007f5a2e2cca96 in gasneti_error_abort () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #6  0x00007f5a2e2ccb87 in _gasneti_fatalerror () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #7  0x00007f5a2f0b49b2 in gasnetc_ofi_tx_poll () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #8  0x00007f5a2f0b4aec in gasnetc_ofi_poll () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #9  0x00007f5a2f0aa8e0 in gasnetc_AMPoll () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #10 0x00007f5a2f0aadbf in gasnetc_exit () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #11 0x00007f5a2e2d70f1 in gasneti_defaultSignalHandler () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #12 <signal handler called>
[13] #13 0x00007f5ad39c076b in raise () from /lib64/libpthread.so.0
[13] #14 <signal handler called>
[13] #15 0x00007f5ad376d759 in syscall () from /lib64/libc.so.6
[13] #16 0x00007f5a2e04c493 in ofi_intercept_munmap (start=0x7f34c8000000, length=51539607552) at prov/util/src/util_mem_hooks.c:547
[13] #17 0x00007f5a2e6c2d46 in Realm::SharedMemoryInfo::~SharedMemoryInfo() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #18 0x00007f5a2e678011 in Realm::RuntimeImpl::~RuntimeImpl() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #19 0x00007f5a2e67d061 in Realm::Runtime::wait_for_shutdown() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../librealm.so.1
[13] #20 0x00007f5a30279aed in Legion::Internal::Runtime::wait_for_shutdown() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegion.so.1
[13] #21 0x00007f5a317379be in legate::detail::Runtime::finish() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegate.so.24.09.00
[13] #22 0x00007f5a3171358d in legate::finish() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegate.so.24.09.00
[13] #23 0x00007f5a16befd62 in __pyx_f_6legate_4_lib_7runtime_7runtime_7Runtime_finish(__pyx_obj_6legate_4_lib_7runtime_7runtime_Runtime*, int) () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
[13] #24 0x00007f5a16bf663e in __pyx_f_6legate_4_lib_7runtime_7runtime__cleanup_legate_runtime() () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
[13] #25 0x00007f5a16bf0e9b in __pyx_pw_11cfunc_dot_to_py_71__Pyx_CFunc_6legate_4_lib_7runtime_7runtime_void__lParen__rParen_to_py__1wrap(_object*, _object*) () from /pscratch/sd/s/seshu/cunumeric/lib/python3.12/site-packages/legate/_lib/runtime/runtime.cpython-312-x86_64-linux-gnu.so
[13] #26 0x000055fc811e3f9c in atexit_callfuncs (state=0x55fc815261a8 <_PyRuntime+79656>) at /usr/local/src/conda/python-3.12.7/Modules/atexitmodule.c:137
[13] #27 0x000055fc811d18c3 in _PyAtExit_Call (interp=<optimized out>) at /usr/local/src/conda/python-3.12.7/Modules/atexitmodule.c:157
[13] #28 Py_FinalizeEx () at /usr/local/src/conda/python-3.12.7/Python/pylifecycle.c:1918
[13] #29 0x000055fc811dfd40 in Py_RunMain () at /usr/local/src/conda/python-3.12.7/Modules/main.c:715
[13] #30 0x000055fc8119a067 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.12.7/Modules/main.c:767
[13] #31 0x00007f5ad369124d in __libc_start_main () from /lib64/libc.so.6
[13] #32 0x000055fc81199f11 in _start ()
[13] [Inferior 1 (process 1843020) detached]

Will try to get a stacktrace in debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants