Fail to finetune with several GPU #71

banalg · 2024-06-13T22:27:34Z

Hello,

We successfully fine-tuned the Mistral7b_v0.3 Instruct model using a single GPU, but we encountered issues when trying to utilize multiple GPUs.

The successful fine-tuning with one GPU (A10 - 24Go)was achieved with the following settings:

Limited the training to 100 steps with a sequence length of 1000
Forced the use of a single GPU using the environment variable CUDA_VISIBLE_DEVICES
Employed a small training file containing around 150 messages

However, we have not been able to successfully configure the setup to use more than one GPU, which limit us to improve the training quality and model knowledge size.

When using several GPU, the train.py seems to block at the dist.barrier() (line 97).
We bypassed this with using the environment variable NCCL_P2P_DISABLE=1, but then we're block arround the batch = next(data_loader) (line 228)

Thank you for your assistance.

Here are the details of our setup

AWS g5.12xlarge (4*A10 - 24Go each)
Ubuntu 22.04 LTS
Python 3.10 venv
NCCL version 2.19.3+cuda12.3 (we tried several versions without success)
tested in non-root and root

Command used to run the training

CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml

The config file example/config_instruct_v1.yaml

data:
  instruct_data: "../data/instruct_request_v0.2.json"  # Fill
  data: ""  # Optionally fill with pretraining data
  eval_instruct_data: ""  # Optionally fill

  # model
  model_id_or_path: "../mistral_models/instruct"  # Change to downloaded path
  lora:
    rank: 64
  
  # optim
  seq_len: 32768
  batch_size: 1
  max_steps: 300
  optim:
    lr: 6.e-5
    weight_decay: 0.1
    pct_start: 0.05
  
  # other
  seed: 0
  log_freq: 1
  eval_freq: 100
  no_eval: True
  ckpt_freq: 100
  
  save_adapters: True  # save only trained LoRA adapters. Set to `False` to merge LoRA adapter into the base model and save full fine-tuned model
  
  run_dir: "/data/ft/finetuning_instruct_admin_1"

Logs of the train.py

(.venv2) root@ip-10-10-10-10:/data/ft/mistral-finetune# TORCH_LOGS="all" CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING]
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] *****************************************
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] *****************************************
[2024-06-13 21:57:36,865] torch.distributed.elastic.rendezvous.static_tcp_rendezvous: [INFO] Creating TCPStore as the c10d::Store implementation
[2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpm3mt26aa
[2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpm3mt26aa/_remote_module_non_scriptable.py
[2024-06-13 21:57:38,581] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpt13cbb4l
[2024-06-13 21:57:38,582] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpt13cbb4l/_remote_module_non_scriptable.py
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 0
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 0
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms...
[2024-06-13 21:57:39,323] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Run dir: /data/ft/finetuning_instruct_admin_1
[rank0]:[2024-06-13 21:57:39,324] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives.
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 1
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 1
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms...
[2024-06-13 21:57:39,555] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
[rank1]:[2024-06-13 21:57:39,556] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives.
NCCL version 2.19.3+cuda12.3
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - TrainArgs: {'batch_size': 1,
'checkpoint': True,
'ckpt_freq': 100,
'data': {'data': '',
'eval_instruct_data': '',
'instruct': {'dynamic_chunk_fn_call': True, 'shuffle': True},
'instruct_data': '../data/instruct_request_v0.2.json',
'shuffle': False},
'eval_freq': 100,
'log_freq': 1,
'lora': {'dropout': 0.0, 'enable': True, 'rank': 64, 'scaling': 2.0},
'max_norm': 1.0,
'max_steps': 300,
'mlflow': {'experiment_name': None, 'tracking_uri': None},
'model_id_or_path': '../mistral_models/instruct',
'no_ckpt': False,
'no_eval': True,
'num_ckpt_keep': 3,
'num_microbatches': 1,
'optim': {'lr': 6e-05, 'pct_start': 0.05, 'weight_decay': 0.1},
'run_dir': '/data/ft/finetuning_instruct_admin_1',
'save_adapters': True,
'seed': 0,
'seq_len': 32768,
'wandb': {'key': None, 'offline': False, 'project': None, 'run_name': None},
'world_size': 2}
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Reloading model from ../mistral_models/instruct/consolidated.safetensors ...
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Converting model to dtype torch.bfloat16 ...
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Loaded model on cpu!
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Initializing lora layers ...
2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Finished initialization!
2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Sharding model over 2 GPUs ...
2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - Model sharded!
2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - 167,772,160 out of 7,415,795,712 parameters are finetuned (2.26%).
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Loading ../data/instruct_request_v0.2.json ...
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - ../data/instruct_request_v0.2.json loaded and tokenized.
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ...
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ...

NCCL logs

tail: /data/var/log/nccl_debug.log: file truncated
ip-10-10-10-10:19202:19202 [0] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0>
ip-10-10-10-10:19202:19202 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ip-10-10-10-10:19202:19202 [0] NCCL INFO cudaDriverVersion 12050
ip-10-10-10-10:19202:19202 [0] NCCL INFO NCCL version 2.19.3+cuda12.3
tail: /data/var/log/nccl_debug.log: file truncated
ip-10-10-10-10:19203:19203 [1] NCCL INFO cudaDriverVersion 12050
ip-10-10-10-10:19203:19203 [1] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0>
ip-10-10-10-10:19203:19203 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
ip-10-10-10-10:19202:19217 [0] NCCL INFO NET/Socket : Using [0]ens5:10.10.10.10<0> [1]br-ae5b5c3787a1:172.19.0.1<0> [2]veth9ea7c0d:fe80::b498:75ff:fe18:107d%veth9ea7c0d<0> [3]veth634837c:fe80::5cd1:75ff:fe11:9735%veth634837c<0> [4]vethbf40cbc:fe80::fc73:9aff:fefd:cbec%vethbf40cbc<0> [5]veth8ae4512:fe80::d49b:5fff:fe17:6b0c%veth8ae4512<0> [6]veth3f1c863:fe80::c464:80ff:fe3e:cd9f%veth3f1c863<0> [7]veth7c8f20a:fe80::64c0:e9ff:fec8:be06%veth7c8f20a<0>
ip-10-10-10-10:19202:19217 [0] NCCL INFO Using non-device net plugin version 0
ip-10-10-10-10:19202:19217 [0] NCCL INFO Using network Socket
2 cudaDev 1 nvmlDev 3 busId 1e0 commId 0x1e2c7b29947d4dbf - Init START
2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init START
ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
-1/-1/-1->1->0
ip-10-10-10-10:19203:19218 [1] NCCL INFO P2P Chunksip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01/02 : 0 1
ip-10-10-10-10:19202:19217 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
ip-10-10-10-10:19202:19217 [0] NCCL INFO P2P Chunksize set to 131072
ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct
ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct
| 512
16-46-125:19202:19217 [0] NCCL INFO Connected all rings
ip-10-10-10-10:19202:19217 [0] NCCL INFO Connected all trees
ip-10-10-10-10:19202:19217 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ip-10-10-10-10:19202:19217 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ip-10-10-10-10:19202:19217 [0] NCCL INFO comm 0x88a2220 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init COMPLETE

The text was updated successfully, but these errors were encountered:

banalg · 2024-06-14T11:04:29Z

It's working now. We simply halted the instance for the night, and after restarting it in the morning, the fine-tuning with all 4 GPUs worked. It's "tombé en marche," as we usually say, but I would prefer to understand why we had issues in the first place. Our instance likely started on another server than yesterday. Could you please recommend some checks to detect the hardware and software configurations of the server that could impact the parallel GPU fine-tuning?

We'll wait a few days before closing this issue.

Aniket-J · 2024-06-23T22:12:43Z

Did you figure out what's been prompting this? Similar setup as yours, tried with NCCL_P2P_DISABLE set to 1, however we're using g4.12xlarge and not g5

SaiKrishnaBala · 2024-08-06T02:47:23Z

Is it possible to run these scripts on a ray cluster as a training job?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to finetune with several GPU #71

Fail to finetune with several GPU #71

banalg commented Jun 13, 2024 •

edited

Loading

banalg commented Jun 14, 2024

Aniket-J commented Jun 23, 2024

SaiKrishnaBala commented Aug 6, 2024

Fail to finetune with several GPU #71

Fail to finetune with several GPU #71

Comments

banalg commented Jun 13, 2024 • edited Loading

Here are the details of our setup

Command used to run the training

The config file example/config_instruct_v1.yaml

Logs of the train.py

NCCL logs

banalg commented Jun 14, 2024

Aniket-J commented Jun 23, 2024

SaiKrishnaBala commented Aug 6, 2024

banalg commented Jun 13, 2024 •

edited

Loading