Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed decrease during training #24

Open
ljb121002 opened this issue Jan 7, 2024 · 1 comment
Open

Speed decrease during training #24

ljb121002 opened this issue Jan 7, 2024 · 1 comment

Comments

@ljb121002
Copy link

ljb121002 commented Jan 7, 2024

We established the environment and preprocessed the data as per the provided instructions. However, while executing the command bash scripts/runs/run_pile_baseline120M.sh, we noticed a sudden reduction in speed after loading specific batches, for example, around 500 out of 200,000. The speed dropped from 2 iterations per second to 6 seconds per iteration, representing a more than 10-fold decrease in speed. This issue also occurs during the second stage when running
bash scripts/runs/run_pile_doremi120M.sh. Our setup comprises 4 A100 nodes with 80 GB of memory each. Do you have such problems during your training before, or do you have any insights into the potential reasons for this occurrence? Thanks!

@sangmichaelxie
Copy link
Owner

I think this is due to on-the-fly caching of the shuffle indices. One test is if you run it a second time with the same seed, do you get the same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants