Skip to content
This repository has been archived by the owner on Nov 1, 2021. It is now read-only.

Issues running examples #40

Closed
juesato opened this issue Dec 8, 2016 · 1 comment
Closed

Issues running examples #40

juesato opened this issue Dec 8, 2016 · 1 comment

Comments

@juesato
Copy link

juesato commented Dec 8, 2016

Hi,

Thanks for releasing this library - it looks awesome! I have the same issue mentioned at #26, when I run allgpu-allreduce: but it occurs even without other jobs running.

t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ th allgpu-allreduce.lua 
Found 8 GPUs, forking children...	
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU7
INFO: torch-ipc: CUDA IPC not possible between GPU7 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU6
INFO: torch-ipc: CUDA IPC not possible between GPU6 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU3
INFO: torch-ipc: CUDA IPC enabled between GPU3 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU0 and GPU2
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU5
INFO: torch-ipc: CUDA IPC not possible between GPU0 and GPU4
INFO: torch-ipc: CUDA IPC not possible between GPU5 and GPU0
INFO: torch-ipc: CUDA IPC not possible between GPU4 and GPU0
INFO: torch-ipc: CUDA IPC enabled between GPU2 and GPU0
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 336): (9, Bad file descriptor)

stack traceback:
	[C]: in function 'client'
	...orch_rbgk40/install/share/lua/5.1/ipc/DiscoveredTree.lua:15: in function 'LocalhostTree'
	allgpu-allreduce.lua:39: in main chunk
	[C]: in function 'dofile'
	...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x00406670
/home/t-jouesa/torch_installs/torch_rbgk40/install/bin/luajit: ...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: ERROR: (/home/t-jouesa/code/torch-ipc/src/cliser.c, 446): (server timed out waiting for clients to connect)

stack traceback:
	[C]: in function 'clients'
	...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:36: in function 'initialServer'
	...installs/torch_rbgk40/install/share/lua/5.1/ipc/Tree.lua:136: in function 'LocalhostTree'
	allgpu-allreduce.lua:39: in main chunk
	[C]: in function 'dofile'
	...rch_rbgk40/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x00406670

I'm able to run examples/allreduce.lua with the CUDA option, and it doesn't hang.

The second issue is that in both cases, not all of the GPUs allow CUDA IPC. This appears similar to #17 but I don't seem to have APC enabled, based on running commands, like

t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 07:00.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 08:08.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 08:10.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 0b:00.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 0c:08.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 0c:10.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 21:00.0 -vvvv|grep -i acs
t-jouesa@rbgk40:~/code/torch-ipc/benchmarks$ lspci -s 22:01.0 -vvvv|grep -i acs
@juesato
Copy link
Author

juesato commented Dec 12, 2016

Sorry my bad, the error messages make total sense here - GPUs 0-3 are connected, and GPUs 4-7 are connected, but not between the two. Closing.

@juesato juesato closed this as completed Dec 12, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant