You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to run a cunumeric example that uses arrays that do not fit into memory of a single GPU. I'm starting with the gemm example which runs without error for problem sizes below the memory limit of a single GPU but it seems to fail when I try to scale beyond a single GPU. For example, this seems to work fine (frame buffer memory size of 36250):
> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 50000
...
Problem Size: M=50000 N=50000 K=50000
Total Iterations: 100
Total Flops: 249997.5 GFLOPS/iter
Total Size: 30000.0 MB
Elapsed Time: 203771.428 ms
Average GEMM: 2037.7142800000001 ms
FLOPS/s: 122685.25693405847 GFLOPS/s
Observed behavior
Increasing the problem size results in an error such as:
(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
...
[0 - 7f47a9e9b000] 2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000] 2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace
Example code or instructions
I've set up my environment using the nv-legate/quickstart recipe for Perlmutter. I'm also using the quickstart run script to run. For example:
(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
Redirecting stdout, stderr and logs to /pscratch/sd/d/dmargala/2024/02/15/103417
Submitted: salloc -q interactive_ss11 -C gpu --gpus-per-node 4 --ntasks-per-node 1 -c 128 -J legate -A nstaff -t 60 -N 2 /pscratch/sd/d/dmargala/work/quickstart/legate.slurm legate --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000
salloc: Granted job allocation 21765749
salloc: Waiting for resource configuration
salloc: Nodes nid[200413,200416] are ready for job
Job ID: 21765749
Submitted from: /pscratch/sd/d/dmargala/work/cunumeric
Started on: Thu 15 Feb 2024 10:34:24 AM PST
Running on: nid[200413,200416]
Command: legate --logdir /pscratch/sd/d/dmargala/2024/02/15/103417 --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000
--- Legion Python Configuration ------------------------------------------------
Legate paths:
legate_dir : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
legate_build_dir : None
bind_sh_path : /pscratch/sd/d/dmargala/legate/bin/bind.sh
legate_lib_path : /pscratch/sd/d/dmargala/legate/lib
Legion paths:
legion_bin_path : /pscratch/sd/d/dmargala/legate/bin
legion_lib_path : /pscratch/sd/d/dmargala/legate/lib
realm_defines_h : /pscratch/sd/d/dmargala/legate/include/realm_defines.h
legion_defines_h : /pscratch/sd/d/dmargala/legate/include/legion_defines.h
legion_spy_py : /pscratch/sd/d/dmargala/legate/bin/legion_spy.py
legion_python : /pscratch/sd/d/dmargala/legate/bin/legion_python
legion_prof : /pscratch/sd/d/dmargala/legate/bin/legion_prof
legion_module : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
legion_jupyter_module : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
Versions:
legate_version : 24.01.00.dev+33.g1d0265c
Command:
srun -n 2 --ntasks-per-node 1 /pscratch/sd/d/dmargala/legate/bin/bind.sh --launcher srun -- /pscratch/sd/d/dmargala/legate/bin/legion_python -ll:py 1 -ll:gpu 4 -cuda:skipbusy -ll:util 2 -ll:bgwork 2 -ll:csize 4000 -ll:fsize 36250 -ll:zsize 32 -level openmp=5,gpu=5 -logfile /pscratch/sd/d/dmargala/2024/02/15/103417/legate_%.log -errlevel 4 -lg:eager_alloc_percentage 50 examples/gemm.py -n 60000
Customized Environment:
CUTENSOR_LOG_LEVEL=1
GASNET_MPI_THREAD=MPI_THREAD_MULTIPLE
LEGATE_MAX_DIM=4
LEGATE_MAX_FIELDS=256
LEGATE_NEED_CUDA=1
LEGATE_NEED_NETWORK=1
NCCL_LAUNCH_MODE=PARALLEL
PYTHONDONTWRITEBYTECODE=1
PYTHONPATH=/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages:/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
REALM_BACKTRACE=1
--------------------------------------------------------------------------------
[0 - 7f47a9e9b000] 2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000] 2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace
[1 - 7fdd7c0b1000] 2.473897 {5}{cunumeric.mapper}: Mapper cunumeric on Node 1 failed to allocate 3600000000 bytes on memory 1e00010000000004 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 189).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[1 - 7fdd7c0b1000] 2.473923 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 1, process 1705264 (thread 7fdd7c0b1000) - obtaining backtrace
Signal 6 received by process 626675 (thread 7f47a9e9b000) at: stack trace: 17 frames
[0] = raise at unknown file:0 [00007f47ba306d2b]
[1] = abort at unknown file:0 [00007f47ba3083e4]
[2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007f474db5d69e]
[3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007f474db75d30]
[4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007f474db7605f]
[5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007f474db766f5]
[6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007f474db7722c]
[7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007f47bf1c5be7]
[8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007f47bf12509e]
[9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12bbbc]
[10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12c0d3]
[11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007f47bf2d2e96]
[12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007f47bd944cb8]
[13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007f47bd944d55]
[14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007f47bd9432d6]
[15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007f47bd949ae9]
[16] = unknown symbol at unknown file:0 [00007f47ba31d73d]
Signal 6 received by process 1705264 (thread 7fdd7c0b1000) at: stack trace: 17 frames
[0] = raise at unknown file:0 [00007fdd8bd06d2b]
[1] = abort at unknown file:0 [00007fdd8bd083e4]
[2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007fdd3d57569e]
[3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007fdd3d58dd30]
[4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007fdd3d58e05f]
[5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007fdd3d58e6f5]
[6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007fdd3d58f22c]
[7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007fdd90b7abe7]
[8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007fdd90ada09e]
[9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae0bbc]
[10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae10d3]
[11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fdd90c87e96]
[12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fdd8f2f9cb8]
[13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fdd8f2f9d55]
[14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fdd8f2f82d6]
[15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007fdd8f2feae9]
[16] = unknown symbol at unknown file:0 [00007fdd8bd1d73d]
srun: error: nid200413: task 0: Exited with exit code 1
srun: Terminating StepId=21765749.0
srun: error: nid200416: task 1: Exited with exit code 1
Command completed on: Thu 15 Feb 2024 10:34:44 AM PST
Job finished: Thu 15 Feb 2024 10:34:44 AM PST
salloc: Relinquishing job allocation 21765749
salloc: Job allocation 21765749 has been revoked.
The text was updated successfully, but these errors were encountered:
Software versions
Jupyter notebook / Jupyter Lab version
No response
Expected behavior
I'm trying to run a cunumeric example that uses arrays that do not fit into memory of a single GPU. I'm starting with the gemm example which runs without error for problem sizes below the memory limit of a single GPU but it seems to fail when I try to scale beyond a single GPU. For example, this seems to work fine (frame buffer memory size of 36250):
Observed behavior
Increasing the problem size results in an error such as:
Example code or instructions
I've set up my environment using the nv-legate/quickstart recipe for Perlmutter. I'm also using the quickstart run script to run. For example:
Stack traceback or browser console output
The text was updated successfully, but these errors were encountered: