-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wikikgv2 Model Training is Hanging #11
Comments
Hi, this does not seem to be related to the smore repo. Here is an instruction I found, can you please check? |
Hi, can you say more about the details of the environment? |
Sure. Here's some basic info. Let me know if you would like anything else.
|
Hi there, sorry for the inconvenience, could you please quickly try with only 1 gpu, and see if it still hangs? We are trying to figure out whether it is due to gpu-gpu communication, or the c++ sampler we used. |
Sorry, I should have been clearer in my initial comment. The code doesn't hang when running with just one GPU. This only occurs when using multiple GPUs. |
We tried one gpu again, actually it still hangs. Sorry for the inconvenience. |
Hi there, Could you please kindly pull the latest code in the wikikgv2 branch, and add |
Hi, We tried your suggestion and now the training doesn't hang! But we got the similar issue in the evaluation. And we noticed your suggestion in another issue and follow the suggestion, but it still hangs. I really appreciate your help! |
hi, what's the script you used? |
Hi, train_shallow_wikikgv2.sh in the training/vec_scripts folder. |
Just to make sure, you used the scripts in train_shallow_wikikgv2.sh and add the |
Actually no, since if we add it we got the unrecognized arguments error. |
have you pulled the most recent commits? |
Sorry we made some mistakes when we merged the code. Now we pulled the the latest code and it works for both training and evaluation! Thank you so much! |
Hi,
I'm having an issue where the model gets stuck while training. It typically happens early in the first epoch. Below is an example when running the
smore/training/vec_scripts/train_shallow_wikikgv2.sh
(unmodified sans the GPUs) on 4 NVIDIA RTX A6000 50GB GPUs.It hangs forever unless I stop it with a keyboard interrupt. Doing so yields the following traceback (I only post a portion because it's very long and repetitive).
It seems like something is happening in the multiprocessing as it's hanging when sharing messages between processes.
Any help would be appreciated! @hyren
Thanks,
Harry
The text was updated successfully, but these errors were encountered: