You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We established the environment and preprocessed the data as per the provided instructions. However, while executing the command bash scripts/runs/run_pile_baseline120M.sh, we noticed a sudden reduction in speed after loading specific batches, for example, around 500 out of 200,000. The speed dropped from 2 iterations per second to 6 seconds per iteration, representing a more than 10-fold decrease in speed. This issue also occurs during the second stage when running bash scripts/runs/run_pile_doremi120M.sh. Our setup comprises 4 A100 nodes with 80 GB of memory each. Do you have such problems during your training before, or do you have any insights into the potential reasons for this occurrence? Thanks!
The text was updated successfully, but these errors were encountered:
I think this is due to on-the-fly caching of the shuffle indices. One test is if you run it a second time with the same seed, do you get the same issue?
We established the environment and preprocessed the data as per the provided instructions. However, while executing the command
bash scripts/runs/run_pile_baseline120M.sh
, we noticed a sudden reduction in speed after loading specific batches, for example, around 500 out of 200,000. The speed dropped from 2 iterations per second to 6 seconds per iteration, representing a more than 10-fold decrease in speed. This issue also occurs during the second stage when runningbash scripts/runs/run_pile_doremi120M.sh
. Our setup comprises 4 A100 nodes with 80 GB of memory each. Do you have such problems during your training before, or do you have any insights into the potential reasons for this occurrence? Thanks!The text was updated successfully, but these errors were encountered: