Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to finetune with several GPU #71

Open
banalg opened this issue Jun 13, 2024 · 3 comments
Open

Fail to finetune with several GPU #71

banalg opened this issue Jun 13, 2024 · 3 comments

Comments

@banalg
Copy link

banalg commented Jun 13, 2024

Hello,

We successfully fine-tuned the Mistral7b_v0.3 Instruct model using a single GPU, but we encountered issues when trying to utilize multiple GPUs.

The successful fine-tuning with one GPU (A10 - 24Go)was achieved with the following settings:

  • Limited the training to 100 steps with a sequence length of 1000
  • Forced the use of a single GPU using the environment variable CUDA_VISIBLE_DEVICES
  • Employed a small training file containing around 150 messages

However, we have not been able to successfully configure the setup to use more than one GPU, which limit us to improve the training quality and model knowledge size.

When using several GPU, the train.py seems to block at the dist.barrier() (line 97).
We bypassed this with using the environment variable NCCL_P2P_DISABLE=1, but then we're block arround the batch = next(data_loader) (line 228)

Thank you for your assistance.

Here are the details of our setup

  • AWS g5.12xlarge (4*A10 - 24Go each)
  • Ubuntu 22.04 LTS
  • Python 3.10 venv
  • NCCL version 2.19.3+cuda12.3 (we tried several versions without success)
  • tested in non-root and root

Command used to run the training

CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml

The config file example/config_instruct_v1.yaml

data:
  instruct_data: "../data/instruct_request_v0.2.json"  # Fill
  data: ""  # Optionally fill with pretraining data
  eval_instruct_data: ""  # Optionally fill

  # model
  model_id_or_path: "../mistral_models/instruct"  # Change to downloaded path
  lora:
    rank: 64
  
  # optim
  seq_len: 32768
  batch_size: 1
  max_steps: 300
  optim:
    lr: 6.e-5
    weight_decay: 0.1
    pct_start: 0.05
  
  # other
  seed: 0
  log_freq: 1
  eval_freq: 100
  no_eval: True
  ckpt_freq: 100
  
  save_adapters: True  # save only trained LoRA adapters. Set to `False` to merge LoRA adapter into the base model and save full fine-tuned model
  
  run_dir: "/data/ft/finetuning_instruct_admin_1"

Logs of the train.py

(.venv2) root@ip-10-10-10-10:/data/ft/mistral-finetune# TORCH_LOGS="all" CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING]
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] *****************************************
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] *****************************************
[2024-06-13 21:57:36,865] torch.distributed.elastic.rendezvous.static_tcp_rendezvous: [INFO] Creating TCPStore as the c10d::Store implementation
[2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpm3mt26aa
[2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpm3mt26aa/_remote_module_non_scriptable.py
[2024-06-13 21:57:38,581] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpt13cbb4l
[2024-06-13 21:57:38,582] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpt13cbb4l/_remote_module_non_scriptable.py
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 0
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 0
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms...
[2024-06-13 21:57:39,323] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Run dir: /data/ft/finetuning_instruct_admin_1
[rank0]:[2024-06-13 21:57:39,324] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives.
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 1
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 1
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms...
[2024-06-13 21:57:39,555] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
[rank1]:[2024-06-13 21:57:39,556] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives.
NCCL version 2.19.3+cuda12.3
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - TrainArgs: {'batch_size': 1,
'checkpoint': True,
'ckpt_freq': 100,
'data': {'data': '',
'eval_instruct_data': '',
'instruct': {'dynamic_chunk_fn_call': True, 'shuffle': True},
'instruct_data': '../data/instruct_request_v0.2.json',
'shuffle': False},
'eval_freq': 100,
'log_freq': 1,
'lora': {'dropout': 0.0, 'enable': True, 'rank': 64, 'scaling': 2.0},
'max_norm': 1.0,
'max_steps': 300,
'mlflow': {'experiment_name': None, 'tracking_uri': None},
'model_id_or_path': '../mistral_models/instruct',
'no_ckpt': False,
'no_eval': True,
'num_ckpt_keep': 3,
'num_microbatches': 1,
'optim': {'lr': 6e-05, 'pct_start': 0.05, 'weight_decay': 0.1},
'run_dir': '/data/ft/finetuning_instruct_admin_1',
'save_adapters': True,
'seed': 0,
'seq_len': 32768,
'wandb': {'key': None, 'offline': False, 'project': None, 'run_name': None},
'world_size': 2}
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Reloading model from ../mistral_models/instruct/consolidated.safetensors ...
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Converting model to dtype torch.bfloat16 ...
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Loaded model on cpu!
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Initializing lora layers ...
2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Finished initialization!
2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Sharding model over 2 GPUs ...
2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - Model sharded!
2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - 167,772,160 out of 7,415,795,712 parameters are finetuned (2.26%).
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Loading ../data/instruct_request_v0.2.json ...
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - ../data/instruct_request_v0.2.json loaded and tokenized.
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ...
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ...

NCCL logs

tail: /data/var/log/nccl_debug.log: file truncated
ip-10-10-10-10:19202:19202 [0] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0>
ip-10-10-10-10:19202:19202 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ip-10-10-10-10:19202:19202 [0] NCCL INFO cudaDriverVersion 12050
ip-10-10-10-10:19202:19202 [0] NCCL INFO NCCL version 2.19.3+cuda12.3
tail: /data/var/log/nccl_debug.log: file truncated
ip-10-10-10-10:19203:19203 [1] NCCL INFO cudaDriverVersion 12050
ip-10-10-10-10:19203:19203 [1] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0>
ip-10-10-10-10:19203:19203 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
ip-10-10-10-10:19202:19217 [0] NCCL INFO NET/Socket : Using [0]ens5:10.10.10.10<0> [1]br-ae5b5c3787a1:172.19.0.1<0> [2]veth9ea7c0d:fe80::b498:75ff:fe18:107d%veth9ea7c0d<0> [3]veth634837c:fe80::5cd1:75ff:fe11:9735%veth634837c<0> [4]vethbf40cbc:fe80::fc73:9aff:fefd:cbec%vethbf40cbc<0> [5]veth8ae4512:fe80::d49b:5fff:fe17:6b0c%veth8ae4512<0> [6]veth3f1c863:fe80::c464:80ff:fe3e:cd9f%veth3f1c863<0> [7]veth7c8f20a:fe80::64c0:e9ff:fec8:be06%veth7c8f20a<0>
ip-10-10-10-10:19202:19217 [0] NCCL INFO Using non-device net plugin version 0
ip-10-10-10-10:19202:19217 [0] NCCL INFO Using network Socket
2 cudaDev 1 nvmlDev 3 busId 1e0 commId 0x1e2c7b29947d4dbf - Init START
2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init START
ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
-1/-1/-1->1->0
ip-10-10-10-10:19203:19218 [1] NCCL INFO P2P Chunksip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01/02 : 0 1
ip-10-10-10-10:19202:19217 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
ip-10-10-10-10:19202:19217 [0] NCCL INFO P2P Chunksize set to 131072
ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct
ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct
| 512
16-46-125:19202:19217 [0] NCCL INFO Connected all rings
ip-10-10-10-10:19202:19217 [0] NCCL INFO Connected all trees
ip-10-10-10-10:19202:19217 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ip-10-10-10-10:19202:19217 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ip-10-10-10-10:19202:19217 [0] NCCL INFO comm 0x88a2220 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init COMPLETE

@banalg
Copy link
Author

banalg commented Jun 14, 2024

It's working now. We simply halted the instance for the night, and after restarting it in the morning, the fine-tuning with all 4 GPUs worked. It's "tombé en marche," as we usually say, but I would prefer to understand why we had issues in the first place. Our instance likely started on another server than yesterday. Could you please recommend some checks to detect the hardware and software configurations of the server that could impact the parallel GPU fine-tuning?

We'll wait a few days before closing this issue.

@Aniket-J
Copy link

Did you figure out what's been prompting this? Similar setup as yours, tried with NCCL_P2P_DISABLE set to 1, however we're using g4.12xlarge and not g5

@SaiKrishnaBala
Copy link

Is it possible to run these scripts on a ray cluster as a training job?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants