Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation Issue #1

Open
ardianumam opened this issue Aug 8, 2021 · 9 comments
Open

Installation Issue #1

ardianumam opened this issue Aug 8, 2021 · 9 comments

Comments

@ardianumam
Copy link

Hi,
Thanks for the awesome work. I'm trying to run the code, via non-docker method (because I only have non-sudo access to my server), and I encounter this error when running bash install.sh. Here is the error message.

mv: cannot stat 'build/lib.linux-x86_64-3.6/StructuralLosses': No such file or directory
Makefile:75: recipe for target 'all' failed
make: *** [all] Error 1

What should I do?

I notice that without above installation, errors will be encountered in metrics/evaluation_metrics.py/metrics.StructuralLosses. For now, I only want to run the code with dummy data to see the data flow (value and each variable shape) to help me understand the details of the paper. Thus, running the non-optimal one (no cuda version) is also OK for me. Any suggestion?

Many thanks!

@jw9730
Copy link
Owner

jw9730 commented Aug 9, 2021

Hi, thank you for your interest in our work.
Can you run the following at the root directory (.../setvae), re-run the installation, and post the entire error message starting from bash install.sh?

mkdir -p metrics/pytorch_structural_losses/build/lib.linux-x86_64-3.6/StructuralLosses

We would also like to know whether all packages were successfully installed during pip install -r requirements.txt.

@ardianumam
Copy link
Author

Hi,
Thanks for the reply. I tried to run the mkdir just like mentioned above and it's OK. Then, I tried to resetup from beginning in my Ubuntu 18 server with 2080Ti GPU: (i) create new python 3.6 env via conda and (ii) run pip install -r requirements.txt. In (ii), I only encounter error due to deepspeed version incompatibility. By changing deepspeed==0.3.13 to deepspeed in the requirements.txt, I can install all the requirements successfully. Then, I proceed to (iii) run bash install.sh, and the full output can be seen here, seems also OK.

Then I proceed to run CUDA_VISIBLE_DEVICES=9 bash scripts/mnist.sh, and found this error message here. I already change the batch_size inside batch_size.json to 8 and also 1, but got similar error. Any suggestion?

Many thanks,
Ardian.

@jw9730
Copy link
Owner

jw9730 commented Aug 9, 2021

This is strange. Can you run nvidia-smi and tell me how many GPUs are in the server?

By the way, deepspeed internally ignores environment device specification (CUDA_VISIBLE_DEVICES=9). To specify a single device, (in principle) you need to make a docker container with restricted visible GPUs. As this solution is unavailable, let's try this:

Can you remove --distributed from the script you are running, and change deepspeed train.py ... to python3 train.py ... and post the result?

If this works, you can proceed with python instead of deepspeed. In this case you need to adjust --batch_size to change batch size, not batch_size.json for --deepspeed_config.

@ardianumam
Copy link
Author

I have 9 GPU in the server.
image

Here is the output after removing --distributed from the script, and change deepspeed train.py ... to python3 train.py. I'm trying to investigate to check all the tensors and model to be in the GPU (tensor.to(DEVICE=GPU)). Meanwhile, any other suggestion?

CUDA_VISIBLE_DEVICES=9 bash scripts/mnist.sh
Arguments:
Namespace(activation='relu', batch_size=32, beta=0.01, beta1=0.9, beta2=0.999, bn_mode='eval', cates=['airplane'], d_net='set_transformer', dataset_scale=1.0, dataset_type='mnist', dec_in_layers=0, dec_out_layers=0, deepscale=False, deepscale_config=None, deepspeed=False, deepspeed_config='batch_size.json', deepspeed_mpi=False, denormalized_loss=False, device='cuda', digits=None, dist_backend='nccl', dist_url='tcp://127.0.0.1:9991', distributed=False, dropout_p=0.0, enc_in_layers=0, epochs=200, eval=False, eval_with_train_offset=False, exp_decay=1.0, exp_decay_freq=1, fixed_gmm=False, gpu=None, hidden_dim=64, i_net='elem_mlp', i_net_layers=0, init_dim=32, input_dim=2, isab_inds=16, kl_warmup_epochs=50, ln=True, local_rank=0, log_freq=10, log_name='gen/mnist/camera-ready', lr=0.001, matcher='chamfer', max_grad_norm=5.0, max_grad_threshold=None, max_outputs=400, max_validate_shapes=None, mnist_cache=None, mnist_data_dir='cache/mnist', momentum=0.9, multimnist_cache=None, multimnist_data_dir='cache/multimnist', n_mixtures=4, no_eval_sampling=False, no_validation=False, normalize_per_shape=False, normalize_std_per_axis=False, num_heads=4, num_workers=4, optimizer='adam', rank=0, residual=False, resume=False, resume_checkpoint=None, resume_dataset_mean=None, resume_dataset_std=None, resume_non_strict=False, resume_optimizer=True, save_freq=10, save_val_results=False, scheduler='linear', seed=42, shapenet_data_dir='/data/shapenet/ShapeNetCore.v2.PC15k', slot_att=True, standardize_per_shape=False, te_max_sample_points=2048, threshold=0.0, tr_max_sample_points=2048, train_gmm=False, use_bn=False, val_freq=1000, val_recon_only=False, viz_freq=10, warmup_epochs=0, weight_decay=0.0, world_size=1, z_dim=16, z_scales=[2, 4, 8, 16, 32])
[2021-08-09 15:35:28,105] [INFO] [distributed.py:37:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
--------------------------------------------------------------------------
[[48330,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: iserver

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[2021-08-09 15:35:28,563] [INFO] [distributed.py:89:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=140.113.24.146, master_port=29500
[2021-08-09 15:35:28,563] [INFO] [distributed.py:47:init_distributed] Initializing torch distributed with backend: nccl
number of params: 538914
number of generator params: 282594
Total number of data:60000
Max number of points: (train)342
Total number of data:10000
Max number of points: (test)290
Start epoch: 0 End epoch: 200
Traceback (most recent call last):
  File "train.py", line 223, in <module>
    main()
  File "train.py", line 219, in main
    main_worker(save_dir, args)
  File "train.py", line 166, in main_worker
    train_one_epoch(epoch, model, criterion, optimizer, args, train_loader, avg_meters, logger)
  File "/home/aumam/dev/gan/new_setvae/setvae/engine.py", line 24, in train_one_epoch
    output = model(gt, gt_mask)
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 221, in forward
    bup = self.bottom_up(x, x_mask)
  File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 183, in bottom_up
    x = self.input(x)  # [B, N, D]
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Tensor for argument #3 'mat2' is on CPU, but expected it to be on GPU (while checking arguments for addmm)
Done

@jw9730
Copy link
Owner

jw9730 commented Aug 9, 2021

At line 114 of train.py, you need to send the model and dataloaders to the device like this: model = model.to(args.device) (same for dataloaders).

There could be several more errors like (unmatched device) this because we checked our code with deepspeed (with --distributed flag). Whenever a similar error happens, please add tensor = tensor.cuda() to where it happened as a temporary solution as you mentioned.

We will shortly add a patch for non-deepspeed use cases.

@ardianumam
Copy link
Author

Many hanks for your reply :)
I already modify so that all the related tensors are in cuda device. And now get this error. I already use batch_size=3 (small enough) and there is unused ~11Gb GPU mem.

Start epoch: 0 End epoch: 200
gt:  cuda:0 torch.Size([3, 342, 2])
gt_mask:  cuda:0 torch.Size([3, 342])
x, x_mask device:  cuda:0 cuda:0
x device:  cuda:0
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 221, in main
    main_worker(save_dir, args)
  File "train.py", line 168, in main_worker
    train_one_epoch(epoch, model, criterion, optimizer, args, train_loader, avg_meters, logger)
  File "/home/aumam/dev/gan/new_setvae/setvae/engine.py", line 25, in train_one_epoch
    output = model(gt, gt_mask)
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 226, in forward
    bup = self.bottom_up(x, x_mask)
  File "/home/aumam/dev/gan/new_setvae/setvae/models/networks.py", line 187, in bottom_up
    x = self.input(x)  # [B, N, D]
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`
Done

@jw9730
Copy link
Owner

jw9730 commented Aug 9, 2021

I honestly have no idea for this. Seems like an issue related to torch backend, because the problematic module is nn.Linear(), not a custom one.
Can you open a new python3 console and see if the following code works without error?

import torch
fc = nn.Linear(100, 100).to('cuda')
x = torch.randn(200, 100).to('cuda')
o = fc(x)
print(o.size())

@ardianumam
Copy link
Author

Yes, I can run above code without error. It turns out that I install torch using cudatoolkit=10 meanwhile my nvdia-smi has cudatoolkit=11. After upgrade torch using the version of 11, it's OK. However, another error is encountered, because of (a) model.optimizer.zero_grad() and (b) model.backward(loss).

File "train.py", line 168, in main_worker
    train_one_epoch(epoch, model, criterion, optimizer, args, train_loader, avg_meters, logger)
  File "/home/aumam/dev/gan/new_setvae/setvae/engine.py", line 46, in train_one_epoch
    model.optimizer.zero_grad()
  File "/home/aumam/.conda/envs/setvae/lib/python3.6/site-packages/torch/nn/modules/module.py", line 948, in __getattr__
    type(self).__name__, name))
AttributeError: 'SetVAE' object has no attribute 'optimizer'

Usually people just use (c) optimizer.zero_grad() and (d) loss.backward(). However, if I use (c) and (d), I get this error:

File "/home/aumam/dev/gan/new_setvae/setvae/engine.py", line 53, in train_one_epoch
    param_norm = p.grad.data.norm(2)
AttributeError: 'NoneType' object has no attribute 'data'

I guess this is related to deepspeed style? For my temporary purpose, I just uncomment all related to them to run the program :D (again, my purpose now is still trying to understand the paper detail via code)

@jw9730
Copy link
Owner

jw9730 commented Aug 9, 2021

You are right. These are all related to deepspeed-style of handling model, gradient, and optimizer.
I will work on a patch right after the NeurIPS rebuttal period. For now, please comment them out.
My apology. I apparently had a bad coding style when I was working on the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants