You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to run a basic example of tensor multiplication using GPUs.
I first got that the .contiguous() function is not implemented for CUDALongTensors.
I implemented the function (see implementation below) but I get some error from the replicate_shares function (crypten/mpc/primitives/replicated.py), specifically RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.11]:60057: Bad address.
deftest():
# Creating two rnd cryptensor and multiplying them aa=torch.randn(2)
bb=torch.randn(2)
rank=comm.get().get_rank()
enc_aa=crypten.cryptensor(aa,src=0).to(f"cuda:{rank}")
enc_bb=crypten.cryptensor(bb,src=0).to(f"cuda:{rank}")
enc_aa*enc_bbif__name__=='__main__':
# The MultiProcessLauncher comes from the example folderlauncher=MultiProcessLauncher(
3, test,
)
launcher.start()
launcher.join()
launcher.terminate()
Complete error output
Process process 0:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/experiments/multi_proc/multiprocess_launcher.py", line 62, in _run_process
run_process_fn()
File "/workspace/experiments/multi_proc/cuda_example_2pc.py", line 182, in test
enc_aa * enc_bb
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 497, in __mul__
return self.mul(tensor)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 338, in autograd_forward
return self.__getattribute__(name)(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/mpc.py", line 339, in binary_wrapper_function
result._tensor = getattr(result._tensor, name)(value, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 428, in mul
return self._arithmetic_function(y, "mul")
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 384, in _arithmetic_function
getattr(protocol, op)(result, y, *args, **kwargs).share.data
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 73, in mul
return __replicated_secret_sharing_protocol("mul", x, y)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 59, in __replicated_secret_sharing_protocol
x_shares, y_shares = replicate_shares([x.share, y.share])
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 28, in replicate_shares
send_req = comm.get().isend(share.contiguous(), dst=next_rank)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/communicator.py", line 234, in logging_wrapper
return func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/distributed_communicator.py", line 133, in isend
return dist.isend(tensor.data, dst, group=self.main_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 742, in isend
return group.send([tensor], group_dst_rank, tag)
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.11]:60057: Bad address
terminate called after throwing an instance of 'gloo::IoException'
what(): [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.11]:61592: Bad address
Process process 2:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/experiments/multi_proc/multiprocess_launcher.py", line 62, in _run_process
run_process_fn()
File "/workspace/experiments/multi_proc/cuda_example_2pc.py", line 182, in test
enc_aa * enc_bb
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 497, in __mul__
return self.mul(tensor)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 338, in autograd_forward
return self.__getattribute__(name)(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/mpc.py", line 339, in binary_wrapper_function
result._tensor = getattr(result._tensor, name)(value, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 428, in mul
return self._arithmetic_function(y, "mul")
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 384, in _arithmetic_function
getattr(protocol, op)(result, y, *args, **kwargs).share.data
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 73, in mul
return __replicated_secret_sharing_protocol("mul", x, y)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 59, in __replicated_secret_sharing_protocol
x_shares, y_shares = replicate_shares([x.share, y.share])
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 30, in replicate_shares
recv_req = comm.get().irecv(rep_shares[-1], src=prev_rank)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/communicator.py", line 234, in logging_wrapper
return func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/distributed_communicator.py", line 139, in irecv
return dist.irecv(tensor.data, src=src, group=self.main_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 781, in irecv
return pg.recv([tensor], group_src_rank, tag)
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.11]:23944
Traceback (most recent call last):
Machine setup
Single machine with 3 Tesla V100 GPU
I am running my code in a docker container
The text was updated successfully, but these errors were encountered:
Description
I was trying to run a basic example of tensor multiplication using GPUs.
I first got that the .contiguous() function is not implemented for CUDALongTensors.
I implemented the function (see implementation below) but I get some error from the replicate_shares function (
crypten/mpc/primitives/replicated.py
), specificallyRuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.11]:60057: Bad address.
Code
Contiguous code function for CUDALongTensor
Code that leads to the error
Complete error output
Machine setup
Single machine with 3 Tesla V100 GPU
I am running my code in a docker container
The text was updated successfully, but these errors were encountered: