You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We successfully fine-tuned the Mistral7b_v0.3 Instruct model using a single GPU, but we encountered issues when trying to utilize multiple GPUs.
The successful fine-tuning with one GPU (A10 - 24Go)was achieved with the following settings:
Limited the training to 100 steps with a sequence length of 1000
Forced the use of a single GPU using the environment variable CUDA_VISIBLE_DEVICES
Employed a small training file containing around 150 messages
However, we have not been able to successfully configure the setup to use more than one GPU, which limit us to improve the training quality and model knowledge size.
When using several GPU, the train.py seems to block at the dist.barrier() (line 97).
We bypassed this with using the environment variable NCCL_P2P_DISABLE=1, but then we're block arround the batch = next(data_loader) (line 228)
Thank you for your assistance.
Here are the details of our setup
AWS g5.12xlarge (4*A10 - 24Go each)
Ubuntu 22.04 LTS
Python 3.10 venv
NCCL version 2.19.3+cuda12.3 (we tried several versions without success)
data:
instruct_data: "../data/instruct_request_v0.2.json" # Fill
data: "" # Optionally fill with pretraining data
eval_instruct_data: "" # Optionally fill
# model
model_id_or_path: "../mistral_models/instruct" # Change to downloaded path
lora:
rank: 64
# optim
seq_len: 32768
batch_size: 1
max_steps: 300
optim:
lr: 6.e-5
weight_decay: 0.1
pct_start: 0.05
# other
seed: 0
log_freq: 1
eval_freq: 100
no_eval: True
ckpt_freq: 100
save_adapters: True # save only trained LoRA adapters. Set to `False` to merge LoRA adapter into the base model and save full fine-tuned model
run_dir: "/data/ft/finetuning_instruct_admin_1"
Logs of the train.py
(.venv2) root@ip-10-10-10-10:/data/ft/mistral-finetune# TORCH_LOGS="all" CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING]
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] *****************************************
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-06-13 21:57:36,864] torch.distributed.run: [WARNING] *****************************************
[2024-06-13 21:57:36,865] torch.distributed.elastic.rendezvous.static_tcp_rendezvous: [INFO] Creating TCPStore as the c10d::Store implementation
[2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpm3mt26aa
[2024-06-13 21:57:38,580] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpm3mt26aa/_remote_module_non_scriptable.py
[2024-06-13 21:57:38,581] torch.distributed.nn.jit.instantiator: [INFO] Created a temporary directory at /data/tmp/tmpt13cbb4l
[2024-06-13 21:57:38,582] torch.distributed.nn.jit.instantiator: [INFO] Writing /data/tmp/tmpt13cbb4l/_remote_module_non_scriptable.py
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 0
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 0
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms...
[2024-06-13 21:57:39,323] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Run dir: /data/ft/finetuning_instruct_admin_1
[rank0]:[2024-06-13 21:57:39,324] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives.
args: TrainArgs(data=DataArgs(data='', shuffle=False, instruct_data='../data/instruct_request_v0.2.json', eval_instruct_data='', instruct=InstructArgs(shuffle=True, dynamic_chunk_fn_call=True)), model_id_or_path='../mistral_models/instruct', run_dir='/data/ft/finetuning_instruct_admin_1', optim=OptimArgs(lr=6e-05, weight_decay=0.1, pct_start=0.05), seed=0, num_microbatches=1, seq_len=32768, batch_size=1, max_norm=1.0, max_steps=300, log_freq=1, ckpt_freq=100, save_adapters=True, no_ckpt=False, num_ckpt_keep=3, eval_freq=100, no_eval=True, checkpoint=True, world_size=2, wandb=WandbArgs(project=None, offline=False, key=None, run_name=None), mlflow=MLFlowArgs(tracking_uri=None, experiment_name=None), lora=LoraArgs(enable=True, rank=64, dropout=0.0, scaling=2.0))
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - torch.cuda.device_count: 2
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - CUDA_VISIBLE_DEVICES: 2,3
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - local rank: 1
2024-06-13 21:57:39 (UTC) - 0:00:02 - distributed - INFO - Set cuda device to 1
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - Going to init comms...
[2024-06-13 21:57:39,555] torch.distributed.distributed_c10d: [INFO] Using backend config: {'cuda': 'nccl'}
[rank1]:[2024-06-13 21:57:39,556] torch.distributed.distributed_c10d: [INFO] Using device cuda for object collectives.
NCCL version 2.19.3+cuda12.3
2024-06-13 21:57:39 (UTC) - 0:00:02 - train - INFO - TrainArgs: {'batch_size': 1,
'checkpoint': True,
'ckpt_freq': 100,
'data': {'data': '',
'eval_instruct_data': '',
'instruct': {'dynamic_chunk_fn_call': True, 'shuffle': True},
'instruct_data': '../data/instruct_request_v0.2.json',
'shuffle': False},
'eval_freq': 100,
'log_freq': 1,
'lora': {'dropout': 0.0, 'enable': True, 'rank': 64, 'scaling': 2.0},
'max_norm': 1.0,
'max_steps': 300,
'mlflow': {'experiment_name': None, 'tracking_uri': None},
'model_id_or_path': '../mistral_models/instruct',
'no_ckpt': False,
'no_eval': True,
'num_ckpt_keep': 3,
'num_microbatches': 1,
'optim': {'lr': 6e-05, 'pct_start': 0.05, 'weight_decay': 0.1},
'run_dir': '/data/ft/finetuning_instruct_admin_1',
'save_adapters': True,
'seed': 0,
'seq_len': 32768,
'wandb': {'key': None, 'offline': False, 'project': None, 'run_name': None},
'world_size': 2}
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Reloading model from ../mistral_models/instruct/consolidated.safetensors ...
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Converting model to dtype torch.bfloat16 ...
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Loaded model on cpu!
2024-06-13 21:57:39 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Initializing lora layers ...
2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Finished initialization!
2024-06-13 21:57:40 (UTC) - 0:00:03 - finetune.wrapped_model - INFO - Sharding model over 2 GPUs ...
2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - Model sharded!
2024-06-13 21:57:46 (UTC) - 0:00:09 - finetune.wrapped_model - INFO - 167,772,160 out of 7,415,795,712 parameters are finetuned (2.26%).
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Loading ../data/instruct_request_v0.2.json ...
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - ../data/instruct_request_v0.2.json loaded and tokenized.
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ...
2024-06-13 21:57:46 (UTC) - 0:00:10 - dataset - INFO - Shuffling ../data/instruct_request_v0.2.json ...
NCCL logs
tail: /data/var/log/nccl_debug.log: file truncated
ip-10-10-10-10:19202:19202 [0] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0>
ip-10-10-10-10:19202:19202 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ip-10-10-10-10:19202:19202 [0] NCCL INFO cudaDriverVersion 12050
ip-10-10-10-10:19202:19202 [0] NCCL INFO NCCL version 2.19.3+cuda12.3
tail: /data/var/log/nccl_debug.log: file truncated
ip-10-10-10-10:19203:19203 [1] NCCL INFO cudaDriverVersion 12050
ip-10-10-10-10:19203:19203 [1] NCCL INFO Bootstrap : Using ens5:10.10.10.10<0>
ip-10-10-10-10:19203:19203 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
ip-10-10-10-10:19202:19217 [0] NCCL INFO NET/Socket : Using [0]ens5:10.10.10.10<0> [1]br-ae5b5c3787a1:172.19.0.1<0> [2]veth9ea7c0d:fe80::b498:75ff:fe18:107d%veth9ea7c0d<0> [3]veth634837c:fe80::5cd1:75ff:fe11:9735%veth634837c<0> [4]vethbf40cbc:fe80::fc73:9aff:fefd:cbec%vethbf40cbc<0> [5]veth8ae4512:fe80::d49b:5fff:fe17:6b0c%veth8ae4512<0> [6]veth3f1c863:fe80::c464:80ff:fe3e:cd9f%veth3f1c863<0> [7]veth7c8f20a:fe80::64c0:e9ff:fec8:be06%veth7c8f20a<0>
ip-10-10-10-10:19202:19217 [0] NCCL INFO Using non-device net plugin version 0
ip-10-10-10-10:19202:19217 [0] NCCL INFO Using network Socket
2 cudaDev 1 nvmlDev 3 busId 1e0 commId 0x1e2c7b29947d4dbf - Init START
2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init START
ip-10-10-10-10:19202:19217 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
-1/-1/-1->1->0
ip-10-10-10-10:19203:19218 [1] NCCL INFO P2P Chunksip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01/02 : 0 1
ip-10-10-10-10:19202:19217 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
ip-10-10-10-10:19202:19217 [0] NCCL INFO P2P Chunksize set to 131072
ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 00 : 0[2] -> 1[3] via SHM/direct/direct
ip-10-10-10-10:19202:19217 [0] NCCL INFO Channel 01 : 0[2] -> 1[3] via SHM/direct/direct
| 512
16-46-125:19202:19217 [0] NCCL INFO Connected all rings
ip-10-10-10-10:19202:19217 [0] NCCL INFO Connected all trees
ip-10-10-10-10:19202:19217 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ip-10-10-10-10:19202:19217 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
ip-10-10-10-10:19202:19217 [0] NCCL INFO comm 0x88a2220 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 1d0 commId 0x1e2c7b29947d4dbf - Init COMPLETE
The text was updated successfully, but these errors were encountered:
It's working now. We simply halted the instance for the night, and after restarting it in the morning, the fine-tuning with all 4 GPUs worked. It's "tombé en marche," as we usually say, but I would prefer to understand why we had issues in the first place. Our instance likely started on another server than yesterday. Could you please recommend some checks to detect the hardware and software configurations of the server that could impact the parallel GPU fine-tuning?
Did you figure out what's been prompting this? Similar setup as yours, tried with NCCL_P2P_DISABLE set to 1, however we're using g4.12xlarge and not g5
Hello,
We successfully fine-tuned the Mistral7b_v0.3 Instruct model using a single GPU, but we encountered issues when trying to utilize multiple GPUs.
The successful fine-tuning with one GPU (A10 - 24Go)was achieved with the following settings:
However, we have not been able to successfully configure the setup to use more than one GPU, which limit us to improve the training quality and model knowledge size.
When using several GPU, the train.py seems to block at the
dist.barrier()
(line 97).We bypassed this with using the environment variable
NCCL_P2P_DISABLE=1
, but then we're block arround thebatch = next(data_loader)
(line 228)Thank you for your assistance.
Here are the details of our setup
Command used to run the training
CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc-per-node 2 --master_port $RANDOM -m train example/config_instruct_v1.yaml
The config file example/config_instruct_v1.yaml
Logs of the train.py
NCCL logs
The text was updated successfully, but these errors were encountered: