Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze in S3D TDB on Marlowe #1788

Open
elliottslaughter opened this issue Oct 31, 2024 · 25 comments
Open

Freeze in S3D TDB on Marlowe #1788

elliottslaughter opened this issue Oct 31, 2024 · 25 comments

Comments

@elliottslaughter
Copy link
Contributor

I am running S3D's TDB branch on Marlowe and have encountered a freeze on 2 nodes.

Marlowe consists of a 1SU (Scalable Unit) NVIDIA DGX H100 SuperPOD. One H100 Scalable Unit is 31 NVIDIA DGX H100 servers
https://docs.marlowe.stanford.edu/specs

Note that this is an NVIDIA machine so we're talking about an Infiniband network. I have built Legion with GASNet-EX and the ibv conduit.

There are two sets of backtraces below, taken after the application was frozen for about 10 minutes (and 5 minutes between each set of backtraces):

  • /scratch/eslaught/marlowe_s3d_tdb_2024-10-31/DBO_Test_2/bt2-1
  • /scratch/eslaught/marlowe_s3d_tdb_2024-10-31/DBO_Test_2/bt2-2

Flags for this run include -ll:force_kthreads -lg:inorder 1 -lg:safe_ctrlrepl 1 -lg:no_tracing so you will see the index launch where the application froze on the stack.

@lightsighter
Copy link
Contributor

There's nothing interesting in the backtraces. Do another run with detailed legion spy logging and -level dma=1,xplan=1,gpudma=1.

@elliottslaughter
Copy link
Contributor Author

elliottslaughter commented Oct 31, 2024

Logs are here: /scratch/eslaught/marlowe_s3d_tdb_2024-10-31/DBO_Test_2_spy/spy_*.log

I confirmed at the point where I killed the job that the count of wc -l spy_*.log had not changed for at least 2 minutes (and the job itself had been frozen for about 10 minutes).

@lightsighter
Copy link
Contributor

Realm hang. There is a DMA copy that started but did not finish. There are 62361 copies that started running, but only 62360 that completed:

grep -rI ' started ' * | wc
  62361  810693 9659461
grep -rI ' completed ' * | wc
  62360  810680 9784026

@lightsighter
Copy link
Contributor

Here is the bad copy:

spy_4.log:[4 - 15456df99000]   38.878847 {2}{dma}: dma request 0x1545865c7910 created - plan=0x15458670bd70 before=80020010017008f5 after=80020010017008f6
spy_4.log:[4 - 15456df99000]   38.878852 {2}{dma}: dma request 0x1545865c7910 ready - plan=0x15458670bd70 before=80020010017008f5 after=80020010017008f6
spy_4.log:[4 - 15456df99000]   38.878853 {2}{dma}: dma request 0x1545865c7910 started - plan=0x15458670bd70 before=80020010017008f5 after=80020010017008f6

@artempriakhin or @muraj can one of you guys take a look at the logs?

@eddy16112
Copy link
Contributor

@muraj @artempriakhin not sure if you have access to the machine. Here are the logs related to this bad copy

[4 - 15456df99000]   38.878811 {2}{xplan}: created: plan=0x15458670bd70 domain=IS:<0,0>..<29,29>,dense srcs=1 dsts=1
[4 - 15456df99000]   38.878812 {1}{xplan}: created: plan=0x15458670bd70 srcs[0]=field(101, inst=4000000000c00023, size=8)
[4 - 15456df99000]   38.878814 {1}{xplan}: created: plan=0x15458670bd70 dsts[0]=field(101, inst=4001c001c0c00023, size=8)
[4 - 15456df99000]   38.878841 {1}{xplan}: analysis: plan=0x15458670bd70 dim_order=[0, 1] xds=1 ibs=0
[4 - 15456df99000]   38.878844 {1}{xplan}: analysis: plan=0x15458670bd70 xds[0]: target=0 inputs=[inst(4000000000c00023:(1e00000000000003:GPU_FB_MEM),0+1)] outputs=[inst(4001c001c0c00023:(1e00070000000003:GPU_FB_MEM),0+1)] channel=12
[4 - 15456df99000]   38.878847 {2}{dma}: dma request 0x1545865c7910 created - plan=0x15458670bd70 before=80020010017008f5 after=80020010017008f6
[4 - 15456df99000]   38.878852 {2}{dma}: dma request 0x1545865c7910 ready - plan=0x15458670bd70 before=80020010017008f5 after=80020010017008f6
[4 - 15456df99000]   38.878853 {2}{dma}: dma request 0x1545865c7910 started - plan=0x15458670bd70 before=80020010017008f5 after=80020010017008f6

channel 12 is the XFER_REMOTE_WRITE, I feel like it is related to the obcount issue of gasnetex. @elliottslaughter Could you please try to increase the -gex:obcount or set -gex:bindcuda 0 to disable GPUDirect?

@lightsighter just curious why do you issue a copy from node 4, while src is on node 0 and dst is on node 7. It does not sounds efficient.

@elliottslaughter
Copy link
Contributor Author

With -gex:bindcuda 0 it runs a little further and then fails with:

[1 - 155222475000]   31.047334 {6}{realm}: invalid event handle: id=154539bc0
s3d.x: /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/runtime_impl.cc:2952: Realm::EventImpl* Realm::RuntimeImpl::get_event_impl(Realm::Event): Assertion `0 && "invalid event handle"' failed.

Backtrace may not be especially helpful, but it's:

[1] #14 0x000015554063a8fe in Realm::RuntimeImpl::get_event_impl (this=<optimized out>, e=...) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/runtime_impl.cc:2952
[1] #15 0x0000155540559575 in Realm::Event::has_triggered_faultaware (poisoned=@0x1545421286bf: false, this=0x154539bc0190) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/event_impl.cc:67
[1] #16 Realm::Event::has_triggered_faultaware (this=0x154539bc0190, poisoned=@0x1545421286bf: false) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/event_impl.cc:61
[1] #17 0x0000155540566f42 in Realm::GenEventImpl::merge_events (wait_for=..., ignore_faults=false) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/utils.h:307
[1] #18 0x000015554264acbd in Realm::Event::merge_events (wait_for=...) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/event.inl:95
[1] #19 Legion::Internal::Runtime::merge_events (events=..., info=0x0) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/legion/runtime.h:5573
[1] #20 Legion::Internal::InnerContext::process_ready_queue (this=<optimized out>) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/legion/legion_context.cc:8789
[1] #21 0x000015554264b02d in Legion::Internal::InnerContext::handle_ready_queue (args=args@entry=0x154539bb58e0) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/legion/legion_context.cc:12300
[1] #22 0x0000155542a1a574 in Legion::Internal::Runtime::legion_runtime_task (args=0x154539bb58e0, arglen=12, userdata=<optimized out>, userlen=<optimized out>, p=...) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/legion/runtime.cc:34807
[1] #23 0x00001555406659e9 in Realm::Task::execute_on_processor (this=0x154539bb5760, p=...) at /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/tasks.cc:326

@lightsighter
Copy link
Contributor

Backtrace may not be especially helpful, but it's:

Turn off detailed Legion Spy logging and see if it goes away. Also which branch are you on (because it's definitely not the master branch)?

@elliottslaughter
Copy link
Contributor Author

Oh, right. I forgot this was a Legion Spy build. Thanks.

I'm running onepool because on master I need to raise -lg:eager_alloc_percentage above 50, and then it just becomes impossible to run.

@lightsighter
Copy link
Contributor

@lightsighter just curious why do you issue a copy from node 4, while src is on node 0 and dst is on node 7. It does not sounds efficient.

Legion's dependence analysis often does not happen on the node where the data is because data is used in multiple different places (more than just two nodes). It is a very common pattern to be requesting a copy between two nodes from a third node.

@elliottslaughter
Copy link
Contributor Author

Ok, after backing out Legion Spy, I can confirm that -gex:bindcuda 0 works.

@lightsighter
Copy link
Contributor

So this is a duplicate then of #1262

@elliottslaughter
Copy link
Contributor Author

I thought we were following the formula in #1508 (comment)? Is (4 + 2 * gpus/node) * nodes not a conservative estimate for -gex:obcount on a NVIDIA DGX H100 SuperPOD?

@lightsighter
Copy link
Contributor

It's unclear to me if that is actually the default right now in the GASNetEX module or whether it was just aspirational as in that was something we could do. That can add up to a lot of memory as you scale up as the number of buffers you need on each node scales with O(N) in the number of nodes.

@elliottslaughter
Copy link
Contributor Author

Just FYI that a run with the formula -gex:obcount $(( num_nodes * 40 )) (and not -gex:bindcuda 0) froze on both 2 and 3 nodes. I guess I'll try increasing that more.

@elliottslaughter
Copy link
Contributor Author

Running with -gex:obcount $(( num_nodes * 100 )) at 2 nodes instead results in:

*** ERROR (proc 2): snd status=12(transport retry counter exceeded) vendor_err=0x81 qp_num=0xa99 hca=mlx5_3 op=AM dest=(proc:7, qpi:2)
*** FATAL ERROR (proc 2): in gasnetc_snd_reap_one() at gasnet/GASNet-2024.5.0/ibv-conduit/gasnet_core_sndrcv.c:1009: aborting on reap of failed send
*** NOTICE (proc 2): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 2): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=GNU/13.1.0 sys=x86_64-pc-linux-gnu

Is there anything else I can shut off to reduce the required obcount (e.g. newer CUDA memories) without shutting off GPUDirect entirely?

@elliottslaughter
Copy link
Contributor Author

Ok, running with -gex:obcount $(( num_nodes * 80 )) succeeds at 2 nodes.

At 3 nodes, we run 160 timesteps successfully and then fail with:

[8 - 155220599000]  195.949126 {6}{gpu}: CUDA error reported on GPU 0: unspecified launch failure (CUDA_ERROR_LAUNCH_FAILED)
s3d.x: /projects/m000020/s3d-tdb-onepool-2024-10-29/legion/runtime/realm/cuda/cuda_module.cc:327: bool Realm::Cuda::GPUStream::reap_events(Realm::TimeLimit): Assertion `0' failed.
*** Caught a fatal signal (proc 8): SIGABRT(6)

I notice later down on the log some lines that say:

slurmstepd: error: NVML: Failed to get Compute running process count(15): GPU is lost
slurmstepd: error: NVML: Failed to get usage(15): GPU is lost

Not sure if that's just a reflection of the crash or if "lost" means something more serious happened.

Anyway, this doesn't look immediately related to the obcount issue?

@elliottslaughter
Copy link
Contributor Author

For comparison, with -gex:bindcuda 0 it runs successfully to completion on both 2 and 3 nodes. (3 nodes is the highest node count I can allocate at this time.)

So perhaps the obcount is related.

I haven't tried to do any NIC binding, and I know there are a lot of NICs on this machine, so maybe that would help.

@elliottslaughter
Copy link
Contributor Author

I also tried -gex:obcount $(( num_nodes * 60 )) but it hit the same error as #1788 (comment)

@apryakhin
Copy link
Contributor

had to do a bit of catching up on this obcount problem before responding anything meaningful.

As I understand it, the obcount controls the number of output buffers for sender messages (on the sender side). realm-gex backend allocates a number of those buffers and lets each src/dst pair grab available output buffers to queue up messages that need to be sent. The reasons it’s done this way as gasnet tends to copy the message's data into one of its internal buffers first if messages are not sent from the already registered memory with a network device. And this approach seems somewhat better that a single circular buffer since some of the destinations may be slower than others due to network congestion.

It's possible that we will never completely avoid the static tuning of obcount, even if we resolve it for this specific case. Based on the context/ideas from issue #1508, some type of “buffer sharing” would probably be the most future proof solution in my opinion although not the easiest. When there are a lot of endpoints on each gpu, a single output buffer is tied up to a given src/dst pair until it’s filled up, so we may try to ask those pairs to close the connections even if they are not full yet and when messages are completed to return the buffer into the idle pool, so it can be grabbed for the next src/dst pair. Not suggesting to design the solution here but at least to bring it up.

@elliottslaughter, since we're running on infiniband, we have UCX now that is fully operational, and I will be open to discuss and understand why can't we use it here? Would be good to understand why we don't have this type of problem in UCX, how it is designed and whether it's more efficient? @SeyedMir correct me here please. We certainly ran some benchmarks comparing UCX and GASNet-EX, and the performance shouldn’t be worse from what I remember. And in case it's actually worse, perhaps it would be reasonable to make changes to UCX instead to make sure it matches up with gasnet.

@syamajala
Copy link
Contributor

syamajala commented Nov 1, 2024

We still cannot use UCX with S3D due to not being able to turn off cuda hijack with Regent. #1682 and #1782

@muraj
Copy link

muraj commented Nov 1, 2024

@syamajala I believe these issues should be fixed in master, can you give it a shot?

@eddy16112
Copy link
Contributor

@apryakhin If we can not completely get ride of the obcount, I am wondering if we can throw an error telling people to increase the obcount rather than just hang.

@lightsighter
Copy link
Contributor

At 3 nodes, we run 160 timesteps successfully and then fail with:

That looks like a real crash. Most likely on the application side, but could potentially be a Realm DMA kernel. I would bet on an application kernel though.

Anyway, this doesn't look immediately related to the obcount issue?

That is not related to the obcount issue.

@elliottslaughter
Copy link
Contributor Author

@syamajala I believe these issues should be fixed in master, can you give it a shot?

Regent has been fixed in master but the S3D application still has issues currently being tracked in #1782.

It is possible that, since we also crashed in #1788 (comment), we're looking at a genuine application bug that is merely hidden by various settings (CUDA hijack and/or disabling GPUDirect). But if so then I'll probably need help from the original application authors to chase it down.

@elliottslaughter
Copy link
Contributor Author

@apryakhin If we can not completely get ride of the obcount, I am wondering if we can throw an error telling people to increase the obcount rather than just hang.

I agree that if this is a detectable condition it's definitely worth an error or at least warning informing people that we hit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants