Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicate Secret Sharing protocol not working on GPUs in docker #490

Open
fra-cap opened this issue Aug 24, 2023 · 0 comments
Open

Replicate Secret Sharing protocol not working on GPUs in docker #490

fra-cap opened this issue Aug 24, 2023 · 0 comments

Comments

@fra-cap
Copy link

fra-cap commented Aug 24, 2023

Description

I was trying to run a basic example of tensor multiplication using GPUs.
I first got that the .contiguous() function is not implemented for CUDALongTensors.
I implemented the function (see implementation below) but I get some error from the replicate_shares function (crypten/mpc/primitives/replicated.py), specifically
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.11]:60057: Bad address.

Code

Contiguous code function for CUDALongTensor

# File: crypten/cuda/cuda_tensor.py
def contiguous(self):
        return self._tensor.contiguous()

Code that leads to the error

def test():
    # Creating two rnd cryptensor and multiplying them 
    aa = torch.randn(2)
    bb = torch.randn(2)
    rank = comm.get().get_rank()
    enc_aa = crypten.cryptensor(aa,src=0).to(f"cuda:{rank}")
    enc_bb = crypten.cryptensor(bb,src=0).to(f"cuda:{rank}")
    enc_aa * enc_bb
    

if __name__ == '__main__':
    # The MultiProcessLauncher comes from the example folder
    launcher = MultiProcessLauncher(
            3, test,
    )
    launcher.start()
    launcher.join()
    launcher.terminate()

Complete error output

Process process 0:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/experiments/multi_proc/multiprocess_launcher.py", line 62, in _run_process
    run_process_fn()
  File "/workspace/experiments/multi_proc/cuda_example_2pc.py", line 182, in test
    enc_aa * enc_bb
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 497, in __mul__
    return self.mul(tensor)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 338, in autograd_forward
    return self.__getattribute__(name)(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/mpc.py", line 339, in binary_wrapper_function
    result._tensor = getattr(result._tensor, name)(value, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 428, in mul
    return self._arithmetic_function(y, "mul")
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 384, in _arithmetic_function
    getattr(protocol, op)(result, y, *args, **kwargs).share.data
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 73, in mul
    return __replicated_secret_sharing_protocol("mul", x, y)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 59, in __replicated_secret_sharing_protocol
    x_shares, y_shares = replicate_shares([x.share, y.share])
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 28, in replicate_shares
    send_req = comm.get().isend(share.contiguous(), dst=next_rank)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/communicator.py", line 234, in logging_wrapper
    return func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/distributed_communicator.py", line 133, in isend
    return dist.isend(tensor.data, dst, group=self.main_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 742, in isend
    return group.send([tensor], group_dst_rank, tag)
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.11]:60057: Bad address
terminate called after throwing an instance of 'gloo::IoException'
  what():  [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:378] writev [172.17.0.11]:61592: Bad address
Process process 2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/experiments/multi_proc/multiprocess_launcher.py", line 62, in _run_process
    run_process_fn()
  File "/workspace/experiments/multi_proc/cuda_example_2pc.py", line 182, in test
    enc_aa * enc_bb
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 497, in __mul__
    return self.mul(tensor)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/cryptensor.py", line 338, in autograd_forward
    return self.__getattribute__(name)(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/mpc.py", line 339, in binary_wrapper_function
    result._tensor = getattr(result._tensor, name)(value, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 428, in mul
    return self._arithmetic_function(y, "mul")
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/arithmetic.py", line 384, in _arithmetic_function
    getattr(protocol, op)(result, y, *args, **kwargs).share.data
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 73, in mul
    return __replicated_secret_sharing_protocol("mul", x, y)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 59, in __replicated_secret_sharing_protocol
    x_shares, y_shares = replicate_shares([x.share, y.share])
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/mpc/primitives/replicated.py", line 30, in replicate_shares
    recv_req = comm.get().irecv(rep_shares[-1], src=prev_rank)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/communicator.py", line 234, in logging_wrapper
    return func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/crypten-0.4.0-py3.8.egg/crypten/communicator/distributed_communicator.py", line 139, in irecv
    return dist.irecv(tensor.data, src=src, group=self.main_group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 781, in irecv
    return pg.recv([tensor], group_src_rank, tag)
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.11]:23944
Traceback (most recent call last):

Machine setup

Single machine with 3 Tesla V100 GPU
I am running my code in a docker container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant