Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training get stuck #10

Open
AprLie opened this issue Aug 23, 2022 · 5 comments
Open

Training get stuck #10

AprLie opened this issue Aug 23, 2022 · 5 comments

Comments

@AprLie
Copy link

AprLie commented Aug 23, 2022

Hi,

thanks for developing a useful tool for training larger-scale KG. However, when I use smore to train models like ComplexE or TransE on wikikgv2, it has about a 50% chance of getting stuck in the training step (i.e., after loading the data, and this can happen before or after the checkpoint save steps) . Have you encountered this issue?

BTW, I only find training scripts for TransE and ComplexE, but there are 4 other KGE models, I wonder why they are not trained on wikikgv2, or is there anything need to pay attention to when writing the training scripts?

Many thanks and look forward to your reply.

@hyren
Copy link
Collaborator

hyren commented Aug 23, 2022

Hi, thanks for your interest. As for getting stuck, do you mean getting stuck right after data loading and before training, or during training? Any pointers on lines that get stuck / might cause the problem will be extremely helpful for us to check.

We provide TransE and ComplEx as example baselines for wikikgv2. We will support RotatE and DistMult later as well.

@AprLie
Copy link
Author

AprLie commented Aug 23, 2022

Sorry for making you confused. The code gets stuck during training. In most cases, it happens during or after the validation steps (e.g. the tqdm bar stops when it does not reach the final number or just right after "100%" ).
image

When I try to find some cases for you, I encounter one other problem. It seems to appear when I train two models in one server (each model is trained on two GPUs and they will not use the same GPU).
image
update: I restart the model training (the other model is still in training) and it will soon raise the bus error._

Finally, I will appreciate it if you can tell me what changes should make for the running of RotatE and DistMult.

Thanks one more for the reply.

@AprLie
Copy link
Author

AprLie commented Aug 23, 2022

image

One more case when the validation is finished.

@fxmeng
Copy link

fxmeng commented Oct 4, 2022

Hi,

thanks for developing a useful tool for training larger-scale KG. However, when I use smore to train models like ComplexE or TransE on wikikgv2, it has about a 50% chance of getting stuck in the training step (i.e., after loading the data, and this can happen before or after the checkpoint save steps) . Have you encountered this issue?

BTW, I only find training scripts for TransE and ComplexE, but there are 4 other KGE models, I wonder why they are not trained on wikikgv2, or is there anything need to pay attention to when writing the training scripts?

Many thanks and look forward to your reply.

I have encountered all of the problems the same as you.

@hyren
Copy link
Collaborator

hyren commented Oct 9, 2022

Hi, sorry for the late reply. We just pushed a hot fix of stucking during evaluation on wikikgv2 branch. Can you please pull the recent change and try again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants