Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training does not converge #377

Open
liuuu6 opened this issue Mar 13, 2024 · 2 comments
Open

Multi-GPU training does not converge #377

liuuu6 opened this issue Mar 13, 2024 · 2 comments

Comments

@liuuu6
Copy link

liuuu6 commented Mar 13, 2024

I'm having trouble training a DLRM model with 4 GPUs. When running with full dataset , the model achieves a auc of 0.79 after 12 hours of training with 1 GPU, but the auc only reaches 0.76 when I use 4 GPUs for the same amount of time. The loss function has a large swing. Here are the arguments I used:

torchrun --nproc_per_node=4 dlrm-embbag-sparse.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=2048 --print-time --test-mini-batch-size=16384 --use-gpu --dist-backend=nccl --mlperf-logging --test-freq=409600 --processed-data-file=/criteo/preprocessed/ --nepochs=1 --memory-map --mlperf-bin-shuffle --mlperf-bin-loader --raw-data-file=/criteo/preprocessed/day
(The above parameters refer to dlrm/bench/dlrm_s_criteo_terabyte.sh)

Below is the loss chart during my 4-GPU training:

202403131644400252180090124170FD

@hrwleo
Copy link

hrwleo commented Mar 13, 2024 via email

@Qinghe12
Copy link

@liuuu6 Could you tell me where you produced the processed-data-file ?
I have download terabyte dateset (day_0 ... dya_23),but it takes too long to preprocesse dataset(I use dlrm_s_criteo_terabyte.sh)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants