You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 1, 2021. It is now read-only.
Thanks for releasing this library - it looks awesome! I have the same issue mentioned at #26, when I run allgpu-allreduce: but it occurs even without other jobs running.
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ th allgpu-allreduce.lua
Found 8 GPUs, forking children...
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU7
INFO: torch-ipc: CUDA IPC not possible between GPU7 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU6
INFO: torch-ipc: CUDA IPC not possible between GPU6 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU3
INFO: torch-ipc: CUDA IPC enabled between GPU3 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU2
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU5
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU4
INFO: torch-ipc: CUDA IPC not possible between GPU5 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU4 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU2 and GPU0
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 336): (9, Bad file descriptor)
stack traceback:
[C]: in function 'client'
...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: in function 'LocalhostTree'
allgpu-allreduce.lua:39: in main chunk
[C]: in function 'dofile'
...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 446): (server timed out waiting for clients to connect)
stack traceback:
[C]: in function 'clients'
...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: in function 'initialServer'
...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:136: in function 'LocalhostTree'
allgpu-allreduce.lua:39: in main chunk
[C]: in function 'dofile'
...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
I'm able to run examples/allreduce.lua with the CUDA option, and it doesn't hang.
The second issue is that in both cases, not all of the GPUs allow CUDA IPC. This appears similar to #17 but I don't seem to have APC enabled, based on running commands, like
Hi,
Thanks for releasing this library - it looks awesome! I have the same issue mentioned at #26, when I run
allgpu-allreduce
: but it occurs even without other jobs running.I'm able to run
examples/allreduce.lua
with the CUDA option, and it doesn't hang.The second issue is that in both cases, not all of the GPUs allow CUDA IPC. This appears similar to #17 but I don't seem to have APC enabled, based on running commands, like
The text was updated successfully, but these errors were encountered: