Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] gemm example fails with problem size that does not fit in single memory of a single gpu #1125

Open
dmargala opened this issue Feb 15, 2024 · 0 comments

Comments

@dmargala
Copy link

dmargala commented Feb 15, 2024

Software versions

Python      :  3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0]
Platform    :  Linux-5.14.21-150400.24.81_12.0.87-cray_shasta_c-x86_64-with-glibc2.31
Legion      :  v24.01.00.dev-33-g1d0265c
Legate      :  24.01.00.dev+33.g1d0265c
WARNING: Disabling control replication for interactive run
Disable Control Replication
Cunumeric   :  24.01.00.dev+16.gb0738142
Numpy       :  1.26.4
Scipy       :  1.12.0
Numba       :  0.59.0
CTK package :  cuda-version-12.2-he2b69de_2 (conda-forge)
GPU driver  :  525.105.17
GPU devices :
  GPU 0: NVIDIA A100-SXM4-80GB
  GPU 1: NVIDIA A100-SXM4-80GB
  GPU 2: NVIDIA A100-SXM4-80GB
  GPU 3: NVIDIA A100-SXM4-80GB

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

I'm trying to run a cunumeric example that uses arrays that do not fit into memory of a single GPU. I'm starting with the gemm example which runs without error for problem sizes below the memory limit of a single GPU but it seems to fail when I try to scale beyond a single GPU. For example, this seems to work fine (frame buffer memory size of 36250):

> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 50000
...
Problem Size:     M=50000 N=50000 K=50000
Total Iterations: 100
Total Flops:      249997.5 GFLOPS/iter
Total Size:       30000.0 MB
Elapsed Time:     203771.428 ms
Average GEMM:     2037.7142800000001 ms
FLOPS/s:          122685.25693405847 GFLOPS/s

Observed behavior

Increasing the problem size results in an error such as:

(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
...

[0 - 7f47a9e9b000]    2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000]    2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace

Example code or instructions

I've set up my environment using the nv-legate/quickstart recipe for Perlmutter. I'm also using the quickstart run script to run. For example:

INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000

Stack traceback or browser console output

(legate) dmargala@perlmutter:login40:/pscratch/sd/d/dmargala/work/cunumeric> INTERACTIVE=1 ../quickstart/run.sh 2 examples/gemm.py -n 60000
Redirecting stdout, stderr and logs to /pscratch/sd/d/dmargala/2024/02/15/103417
Submitted: salloc -q interactive_ss11 -C gpu --gpus-per-node 4 --ntasks-per-node 1 -c 128 -J legate -A nstaff -t 60 -N 2 /pscratch/sd/d/dmargala/work/quickstart/legate.slurm legate --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000
salloc: Granted job allocation 21765749
salloc: Waiting for resource configuration
salloc: Nodes nid[200413,200416] are ready for job
Job ID: 21765749
Submitted from: /pscratch/sd/d/dmargala/work/cunumeric
Started on: Thu 15 Feb 2024 10:34:24 AM PST
Running on: nid[200413,200416]
Command: legate --logdir /pscratch/sd/d/dmargala/2024/02/15/103417 --launcher srun --cpus 1 --sysmem 4000 --gpus 4 --fbmem 36250 --verbose --log-to-file --nodes 2 --ranks-per-node 1 examples/gemm.py -n 60000

--- Legion Python Configuration ------------------------------------------------

Legate paths:
  legate_dir       : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
  legate_build_dir : None
  bind_sh_path     : /pscratch/sd/d/dmargala/legate/bin/bind.sh
  legate_lib_path  : /pscratch/sd/d/dmargala/legate/lib

Legion paths:
  legion_bin_path       : /pscratch/sd/d/dmargala/legate/bin
  legion_lib_path       : /pscratch/sd/d/dmargala/legate/lib
  realm_defines_h       : /pscratch/sd/d/dmargala/legate/include/realm_defines.h
  legion_defines_h      : /pscratch/sd/d/dmargala/legate/include/legion_defines.h
  legion_spy_py         : /pscratch/sd/d/dmargala/legate/bin/legion_spy.py
  legion_python         : /pscratch/sd/d/dmargala/legate/bin/legion_python
  legion_prof           : /pscratch/sd/d/dmargala/legate/bin/legion_prof
  legion_module         : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
  legion_jupyter_module : /pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages

Versions:
  legate_version : 24.01.00.dev+33.g1d0265c

Command:
  srun -n 2 --ntasks-per-node 1 /pscratch/sd/d/dmargala/legate/bin/bind.sh --launcher srun -- /pscratch/sd/d/dmargala/legate/bin/legion_python -ll:py 1 -ll:gpu 4 -cuda:skipbusy -ll:util 2 -ll:bgwork 2 -ll:csize 4000 -ll:fsize 36250 -ll:zsize 32 -level openmp=5,gpu=5 -logfile /pscratch/sd/d/dmargala/2024/02/15/103417/legate_%.log -errlevel 4 -lg:eager_alloc_percentage 50 examples/gemm.py -n 60000

Customized Environment:
  CUTENSOR_LOG_LEVEL=1
  GASNET_MPI_THREAD=MPI_THREAD_MULTIPLE
  LEGATE_MAX_DIM=4
  LEGATE_MAX_FIELDS=256
  LEGATE_NEED_CUDA=1
  LEGATE_NEED_NETWORK=1
  NCCL_LAUNCH_MODE=PARALLEL
  PYTHONDONTWRITEBYTECODE=1
  PYTHONPATH=/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages:/pscratch/sd/d/dmargala/legate/lib/python3.11/site-packages
  REALM_BACKTRACE=1

--------------------------------------------------------------------------------

[0 - 7f47a9e9b000]    2.473642 {5}{cunumeric.mapper}: Mapper cunumeric on Node 0 failed to allocate 3600000000 bytes on memory 1e00000000000003 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 188).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[0 - 7f47a9e9b000]    2.473667 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 0, process 626675 (thread 7f47a9e9b000) - obtaining backtrace
[1 - 7fdd7c0b1000]    2.473897 {5}{cunumeric.mapper}: Mapper cunumeric on Node 1 failed to allocate 3600000000 bytes on memory 1e00010000000004 (of kind GPU_FB_MEM: Framebuffer memory for one GPU and all its SMs) for region requirement 1 of Task cunumeric::MatMulTask[examples/gemm.py:52] (UID 189).
This means Legate was unable to reserve ouf of its memory pool the full amount required for the above operation. Here are some things to try:
* Make sure your code is not impeding the garbage collection of Legate-backed objects, e.g. by storing references in caches, or creating reference cycles.
* Ask Legate to reserve more space on the above memory, using the appropriate --*mem legate flag.
* Assign less memory to the eager pool, by reducing --eager-alloc-percentage.
* If running on multiple nodes, increase how often distributed garbage collection runs, by reducing LEGATE_FIELD_REUSE_FREQ (default: 32, warning: may incur overhead).
* Adapt your code to reduce temporary storage requirements, e.g. by breaking up larger operations into batches.
* If the previous steps don't help, and you are confident Legate should be able to handle your code's working set, please open an issue on Legate's bug tracker.
[1 - 7fdd7c0b1000]    2.473923 {5}{legate}: Legate called abort in /pscratch/sd/d/dmargala/work/legate.core/src/core/mapping/base_mapper.cc at line 804 in function report_failed_mapping
Signal 6 received by node 1, process 1705264 (thread 7fdd7c0b1000) - obtaining backtrace
Signal 6 received by process 626675 (thread 7f47a9e9b000) at: stack trace: 17 frames
  [0] = raise at unknown file:0 [00007f47ba306d2b]
  [1] = abort at unknown file:0 [00007f47ba3083e4]
  [2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007f474db5d69e]
  [3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007f474db75d30]
  [4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007f474db7605f]
  [5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007f474db766f5]
  [6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007f474db7722c]
  [7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007f47bf1c5be7]
  [8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007f47bf12509e]
  [9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12bbbc]
  [10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007f47bf12c0d3]
  [11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007f47bf2d2e96]
  [12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007f47bd944cb8]
  [13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007f47bd944d55]
  [14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007f47bd9432d6]
  [15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007f47bd949ae9]
  [16] = unknown symbol at unknown file:0 [00007f47ba31d73d]
Signal 6 received by process 1705264 (thread 7fdd7c0b1000) at: stack trace: 17 frames
  [0] = raise at unknown file:0 [00007fdd8bd06d2b]
  [1] = abort at unknown file:0 [00007fdd8bd083e4]
  [2] = legate::mapping::BaseMapper::report_failed_mapping(Legion::Mappable const&, unsigned int, Realm::Memory, int, unsigned long) [clone .cold] at unknown file:0 [00007fdd3d57569e]
  [3] = legate::mapping::BaseMapper::map_legate_store(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, legate::mapping::StoreMapping const&, std::set<Legion::RegionRequirement const*, std::less<Legion::RegionRequirement const*>, std::allocator<Legion::RegionRequirement const*> > const&, Realm::Processor, Legion::Mapping::PhysicalInstance&, bool) at unknown file:0 [00007fdd3d58dd30]
  [4] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&)::{lambda(bool)#1}::operator()(bool) const at unknown file:0 [00007fdd3d58e05f]
  [5] = legate::mapping::BaseMapper::map_legate_stores(Legion::Internal::MappingCallInfo*, Legion::Mappable const&, std::vector<legate::mapping::StoreMapping, std::allocator<legate::mapping::StoreMapping> >&, Realm::Processor, std::map<Legion::RegionRequirement const*, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*, std::less<Legion::RegionRequirement const*>, std::allocator<std::pair<Legion::RegionRequirement const* const, std::vector<Legion::Mapping::PhysicalInstance, std::allocator<Legion::Mapping::PhysicalInstance> >*> > >&) at unknown file:0 [00007fdd3d58e6f5]
  [6] = legate::mapping::BaseMapper::map_task(Legion::Internal::MappingCallInfo*, Legion::Task const&, Legion::Mapping::Mapper::MapTaskInput const&, Legion::Mapping::Mapper::MapTaskOutput&) at unknown file:0 [00007fdd3d58f22c]
  [7] = Legion::Internal::MapperManager::invoke_map_task(Legion::Internal::TaskOp*, Legion::Mapping::Mapper::MapTaskInput*, Legion::Mapping::Mapper::MapTaskOutput*, Legion::Internal::MappingCallInfo*) at unknown file:0 [00007fdd90b7abe7]
  [8] = Legion::Internal::SingleTask::invoke_mapper(Legion::Internal::MustEpochOp*) at unknown file:0 [00007fdd90ada09e]
  [9] = Legion::Internal::SingleTask::map_all_regions(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae0bbc]
  [10] = Legion::Internal::PointTask::perform_mapping(Legion::Internal::MustEpochOp*, Legion::Internal::TaskOp::DeferMappingArgs const*) at unknown file:0 [00007fdd90ae10d3]
  [11] = Legion::Internal::Runtime::legion_runtime_task(void const*, unsigned long, void const*, unsigned long, Realm::Processor) at unknown file:0 [00007fdd90c87e96]
  [12] = Realm::Task::execute_on_processor(Realm::Processor) at unknown file:0 [00007fdd8f2f9cb8]
  [13] = Realm::UserThreadTaskScheduler::execute_task(Realm::Task*) at unknown file:0 [00007fdd8f2f9d55]
  [14] = Realm::ThreadedTaskScheduler::scheduler_loop() at unknown file:0 [00007fdd8f2f82d6]
  [15] = Realm::UserThread::uthread_entry() at unknown file:0 [00007fdd8f2feae9]
  [16] = unknown symbol at unknown file:0 [00007fdd8bd1d73d]
srun: error: nid200413: task 0: Exited with exit code 1
srun: Terminating StepId=21765749.0
srun: error: nid200416: task 1: Exited with exit code 1
Command completed on: Thu 15 Feb 2024 10:34:44 AM PST
Job finished: Thu 15 Feb 2024 10:34:44 AM PST
salloc: Relinquishing job allocation 21765749
salloc: Job allocation 21765749 has been revoked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant