Multi-GPU training does not converge #377

liuuu6 · 2024-03-13T10:46:58Z

I'm having trouble training a DLRM model with 4 GPUs. When running with full dataset , the model achieves a auc of 0.79 after 12 hours of training with 1 GPU, but the auc only reaches 0.76 when I use 4 GPUs for the same amount of time. The loss function has a large swing. Here are the arguments I used:

torchrun --nproc_per_node=4 dlrm-embbag-sparse.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=2048 --print-time --test-mini-batch-size=16384 --use-gpu --dist-backend=nccl --mlperf-logging --test-freq=409600 --processed-data-file=/criteo/preprocessed/ --nepochs=1 --memory-map --mlperf-bin-shuffle --mlperf-bin-loader --raw-data-file=/criteo/preprocessed/day
(The above parameters refer to dlrm/bench/dlrm_s_criteo_terabyte.sh)

Below is the loss chart during my 4-GPU training:

The text was updated successfully, but these errors were encountered:

hrwleo · 2024-03-13T10:47:33Z

已收到邮件阿里阿豆故咋一马斯！如未及时回复，请致电15868848097 QQ：812737452

Qinghe12 · 2024-03-28T06:17:45Z

@liuuu6 Could you tell me where you produced the processed-data-file ?
I have download terabyte dateset (day_0 ... dya_23)，but it takes too long to preprocesse dataset(I use dlrm_s_criteo_terabyte.sh)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training does not converge #377

Multi-GPU training does not converge #377

liuuu6 commented Mar 13, 2024

hrwleo commented Mar 13, 2024 via email

Qinghe12 commented Mar 28, 2024

Multi-GPU training does not converge #377

Multi-GPU training does not converge #377

Comments

liuuu6 commented Mar 13, 2024

hrwleo commented Mar 13, 2024 via email

Qinghe12 commented Mar 28, 2024