Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use multi-training without slurm system? #458

Open
stargolike opened this issue Jun 11, 2024 · 18 comments
Open

How to use multi-training without slurm system? #458

stargolike opened this issue Jun 11, 2024 · 18 comments

Comments

@stargolike
Copy link

stargolike commented Jun 11, 2024

Hello dear developers,I run this script
python /root/mace/scripts/run_train.py --name="MACE_model" \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="test.xyz" \ --config_type_weights='{"Default":1.0}' \ --model="MACE" \ --hidden_irreps='128x0e + 128x1o' \ --r_max=5.0 \ --batch_size=10 \ --energy_key="energy" \ --forces_key="forces" \ --max_num_epochs=100 \ --swa \ --start_swa=80 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --restart_latest \ --device=cuda \
But my computer has two 4090 GPUs, and I have not installed Slurm, so this problem occurred
ERROR:root:Failed to initialize distributed environment: 'SLURM_ JOB_NODELIST
How to solve the problem.

@ilyes319
Copy link
Contributor

If it is just on a single node you can use the interface in the documentation: https://mace-docs.readthedocs.io/en/latest/guide/multigpu.html, on the single-node section.

@stargolike
Copy link
Author

If it is just on a single node you can use the interface in the documentation: https://mace-docs.readthedocs.io/en/latest/guide/multigpu.html, on the single-node section.

thanks for your reply, i use the tutorial to change my code but i meet new problem

Wed Jun 12 19:38:53 CST 2024
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] 
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] *****************************************
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] *****************************************
ERROR:root:Failed to initialize distributed environment: 'SLURM_JOB_NODELIST'
ERROR:root:Failed to initialize distributed environment: 'SLURM_JOB_NODELIST'

i'm running

source activate mace
torchrun --standalone --nnodes=1 --nproc_per_node=2 mace/mace/cli/run_train.py  --config="config.yaml"

and the config.yaml

name: MACE_model
config_type_weights: {"Default":1.0}
model: "MACE"
hidden_irrep: '128x0e + 128x1o'
r_max: 5.0
train_file: train.xyz
test_file: test.xyz
valid_fraction: 0.05
batch_size: 10
energy_key: "energy"
forces_key: "forces"
swa: yes
start_swa: 80
ema: yes
ema_decay: 0.99 
amsgrad: yes
restart_latest: yes
max_num_epochs: 100
device: cuda 
loss: "huber"
distributed: yes

@ilyes319
Copy link
Contributor

You should comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py

@stargolike
Copy link
Author

stargolike commented Jun 13, 2024

You should comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py

I modify the mace package slurm_distributed.py, the root of which is /opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools, and it can run.
But when I run the multi-train on two 4090 GPUs ( i want to run the bigger system, so i need more memory) , it has some trouble. CUDA out of memory

[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.43 GiB. GPU � has a total capacity of 23.65 GiB of which 586.25 MiB is free. Including non-PyTorch memory, this process has 23.05 GiB memory in use. Of the allocated memory 18.60 GiB is allocated by PyTorch, and 3.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
E0613 12:40:41.468833 140559193007296 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 9550) of binary: /opt/miniconda/envs/mace/bin/python

I think if the multi-train is not opened?

2024-06-13 12:40:09.875 INFO: CUDA version: 12.1, CUDA device: 0
2024-06-13 12:40:11.661 INFO: Using isolated atom energies from training file
2024-06-13 12:40:11.712 INFO: Loaded 768 training configurations from 'train.xyz'
2024-06-13 12:40:11.712 INFO: Using random 5.0% of training set for validation
2024-06-13 12:40:11.941 INFO: Since ASE version 3.23.0b1, using energy_key 'energy' is no longer safe when communicating 
2024-06-13 12:40:12.035 INFO: Loaded 192 test configurations from 'test.xyz'
2024-06-13 12:40:12.035 INFO: Total number of configurations: train=730, valid=38, tests=[Default: 192]
2024-06-13 12:40:12.057 INFO: AtomicNumberTable: (1, 8, 17, 30)
2024-06-13 12:40:12.057 INFO: Atomic energies: [-0.15222862, -0.08918347, -0.07295653, -0.01178265]
2024-06-13 12:40:15.365 INFO: WeightedHuberEnergyForcesStressLoss(energy_weight=1.000, forces_weight=100.000, stress_weight=1.000)
2024-06-13 12:40:17.220 INFO: Average number of neighbors: 52.435900568044566
2024-06-13 12:40:17.221 INFO: Selected the following outputs: {'energy': True, 'forces': True, 'virials': True, 'stress': True, 'dipoles': False}
2024-06-13 12:40:17.466 INFO: Building model
2024-06-13 12:40:17.466 INFO: Hidden irreps: 128x0e + 128x1o

@stargolike
Copy link
Author

stargolike commented Jun 14, 2024

You should comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py

i find #143 and i want to solve the problem, so i reinstall the version which has hugfacing. but it also can't run.

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `2`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.

and it also has out of memory

[rank1]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB. GPU � has a total capacity of 23.65 GiB of which 772.25 MiB is free. Including non-PyTorch memory, this process has 22.87 GiB memory in use. Of the allocated memory 19.08 GiB is allocated by PyTorch, and 2.34 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/input_lbg-19657-13216338/./mace/scripts/run_train.py", line 596, in <module>
[rank0]:     main()
[rank0]:   File "/input_lbg-19657-13216338/./mace/scripts/run_train.py", line 500, in main
[rank0]:     tools.train(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/tools/train.py", line 93, in train
[rank0]:     train_one_epoch(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/tools/train.py", line 234, in train_one_epoch
[rank0]:     _, opt_metrics = take_step(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/tools/train.py", line 262, in take_step
[rank0]:     output = model(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]: 
    return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/modules/models.py", line 344, in forward
[rank0]:     forces, virials, stress = get_outputs(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/modules/utils.py", line 135, in get_outputs
[rank0]:     compute_forces(energy=energy, positions=positions, training=training),
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/modules/utils.py", line 26, in compute_forces
[rank0]:     gradient = torch.autograd.grad(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/autograd/__init__.py", line 412, in grad
[rank0]:     result = _engine_run_backward(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.51 GiB. GPU 
W0613 23:39:31.075506 140154674787520 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 45322 closing signal SIGTERM
E0613 23:39:31.139620 140154674787520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 45321) of binary: /opt/miniconda/envs/multi_train/bin/python

i also test 4 gpus train, but similar problem was happened.
i use two methods and i can't make two 4090 gpus running.Is this because MACE copies the complete model to the GPU during the multi-train running?

@svandenhaute
Copy link

I created a modified train script for this which doesn't use the whole DistributedEnvironment thing.

See here. Essentially it comes down to setting the required environment variables manually:

def main() -> None:
    """
    This script runs the training/fine tuning for mace
    """
    args = tools.build_default_arg_parser().parse_args()
    if args.distributed:
        world_size = torch.cuda.device_count()
        import torch.multiprocessing as mp
        mp.spawn(run, args=(args, world_size), nprocs=world_size)
    else:
        run(0, args, 1)


def run(rank: int, args: argparse.Namespace, world_size: int) -> None:
    """
    This script runs the training/fine tuning for mace
    """
    tag = tools.get_tag(name=args.name, seed=args.seed)
    if args.distributed:
        # try:
        #     distr_env = DistributedEnvironment()
        # except Exception as e:  # pylint: disable=W0703
        #     logging.error(f"Failed to initialize distributed environment: {e}")
        #     return
        # world_size = distr_env.world_size
        # local_rank = distr_env.local_rank
        # rank = distr_env.rank
        # if rank == 0:
        #     print(distr_env)
        # torch.distributed.init_process_group(backend="nccl")
        local_rank = rank
        os.environ["MASTER_ADDR"] = "localhost"
        os.environ["MASTER_PORT"] = "12355"
        torch.cuda.set_device(rank)
        torch.distributed.init_process_group(
            backend='nccl',
            rank=rank,
            world_size=world_size,
        )
    else:
        pass

@stargolike
Copy link
Author

I created a modified train script for this which doesn't use the whole DistributedEnvironment thing.

See here. Essentially it comes down to setting the required environment variables manually:

def main() -> None:
    """
    This script runs the training/fine tuning for mace
    """
    args = tools.build_default_arg_parser().parse_args()
    if args.distributed:
        world_size = torch.cuda.device_count()
        import torch.multiprocessing as mp
        mp.spawn(run, args=(args, world_size), nprocs=world_size)
    else:
        run(0, args, 1)


def run(rank: int, args: argparse.Namespace, world_size: int) -> None:
    """
    This script runs the training/fine tuning for mace
    """
    tag = tools.get_tag(name=args.name, seed=args.seed)
    if args.distributed:
        # try:
        #     distr_env = DistributedEnvironment()
        # except Exception as e:  # pylint: disable=W0703
        #     logging.error(f"Failed to initialize distributed environment: {e}")
        #     return
        # world_size = distr_env.world_size
        # local_rank = distr_env.local_rank
        # rank = distr_env.rank
        # if rank == 0:
        #     print(distr_env)
        # torch.distributed.init_process_group(backend="nccl")
        local_rank = rank
        os.environ["MASTER_ADDR"] = "localhost"
        os.environ["MASTER_PORT"] = "12355"
        torch.cuda.set_device(rank)
        torch.distributed.init_process_group(
            backend='nccl',
            rank=rank,
            world_size=world_size,
        )
    else:
        pass

hello, i use your method to change the code, but some errors have happened and i can't understand it.

[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:12355 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:12355 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.

and the torch also has error,

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12355 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:12355 (errno: 98 - Address already in use).

I don't think this is a port problem.Because even if I modify a port that no one has used before, it is still the same error.

@svandenhaute
Copy link

Sounds like your starting it twice. Make sure to use python run_train.py <your_training_args> instead of using torchrun?

@stargolike
Copy link
Author

stargolike commented Jun 15, 2024

Sounds like your starting it twice. Make sure to use python run_train.py <your_training_args> instead of using torchrun?

thanks for your reply, dear.
i change my command and it's

source activate mace
nvidia-smi
python mace/mace/cli/run_train.py  --config="config.yaml"

and cuda memory out too.

  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/input_lbg-19657-13263068/mace/mace/cli/run_train.py", line 705, in run
    tools.train(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools/train.py", line 179, in train
    train_one_epoch(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools/train.py", line 289, in train_one_epoch
    _, opt_metrics = take_step(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools/train.py", line 319, in take_step
    output = model(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/modules/models.py", line 391, in forward
    forces, virials, stress = get_outputs(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/modules/utils.py", line 126, in get_outputs
    forces, virials, stress = compute_forces_virials(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/modules/utils.py", line 51, in compute_forces_virials
    forces, virials = torch.autograd.grad(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/autograd/__init__.py", line 412, in grad
    result = _engine_run_backward(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.45 GiB. GPU 

i use two 4090 gpus to run.

@ilyes319
Copy link
Contributor

Can you share your new log file? It does not seem to be using the two GPUs.

@stargolike
Copy link
Author

Can you share your new log file? It does not seem to be using the two GPUs.

thanks, dear ilyes, it's my log file.
1.log

@stargolike
Copy link
Author

Can you share your new log file? It does not seem to be using the two GPUs.

Hello, dear ilyes, I would like to run a single machine version of a multi GPUs mace (with multiple 4090s), but recently I have tried some other methods but have all failed. Can you explain in detail the specific reason why the dual GPU did not run successfully?
Is it possible that running multiple GPUs can be divided into three types: data splitting without model splitting, model splitting without data splitting, and both model and data splitting?
Thank you for your help.

@ShiQiaoL
Copy link

ShiQiaoL commented Jul 9, 2024

I recently encountered the same problem. When I choose to use two GPUs to execute a task, the CPU is automatically selected for execution. Below is a screenshot of my slurm.sub and the error message.

image
image
image

@ilyes319
Copy link
Contributor

ilyes319 commented Jul 9, 2024

Can you tell me what branch you are using? Note that we only support the official repo and not any modified fork. Also please share you full log file, and not screenshots.

@ShiQiaoL
Copy link

ShiQiaoL commented Jul 9, 2024

Can you tell me what branch you are using? Note that we only support the official repo and not any modified fork. Also please share you full log file, and not screenshots.

I am using the official branch. Here is my slurm submission script run_train.txt and the corresponding log file for the error slurm-2522.log

@ilyes319
Copy link
Contributor

ilyes319 commented Jul 9, 2024

Does the single GPU work? Have you edited the slurm config file to your env variables?

@stargolike
Copy link
Author

stargolike commented Jul 9, 2024

Can you tell me what branch you are using? Note that we only support the official repo and not any modified fork. Also please share you full log file, and not screenshots.

I am using the official branch. Here is my slurm submission script run_train.txt and the corresponding log file for the error slurm-2522.log

dear ShiQiaoL, Distributed in your config is not True. you should watch this document multi. and i think you should use yaml to write your config. it looks so direct.

@stargolike
Copy link
Author

Does the single GPU work? Have you edited the slurm config file to your env variables?

single GPU can do it. I revised the file according to your opinion, but my machine doesn't have slurm, it's stanalone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants