Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Why does training speed go down? #158

Open
zouharvi opened this issue Aug 5, 2023 · 5 comments
Open

[QUESTION] Why does training speed go down? #158

zouharvi opened this issue Aug 5, 2023 · 5 comments
Labels
question Further information is requested

Comments

@zouharvi
Copy link

zouharvi commented Aug 5, 2023

I noticed that comet-train (after encoder finetuning) has speed of ~12it/s at e.g. 30% which drops to ~7it/s at 60% and to ~6it/s at 90% of the epoch.

  1. Is that something particular to only me or did anyone else observe this as well?
  2. If yes, is this expected behaviour?

I'm using NVIDIA A10G GPUs and the following software versions:

  • Python - 3.10.9
  • COMET - upstream
  • torch - 2.0.1
  • pytorch-lightning - 1.9.5
  • transformers - 4.29.0
  • numpy - 1.24.3
@zouharvi zouharvi added the question Further information is requested label Aug 5, 2023
@maxiek0071
Copy link

Hi zouharvi,

I noticed this behavior as well. I think it has something to do with "Encoder model fine-tuning". After this the speed gradually decreases for me from 13.98it/s to 5.85it/s at the end of the epoch.

Could someone comment if this is an excepted behavior?

@zouharvi
Copy link
Author

zouharvi commented Aug 8, 2023

Indeed, without encoder fine-tuning (nr_frozen_epochs=1), this does not happen. Shot in the dark: I wonder if there is some memory leak associated with that which leaves some grad-able objects on the GPU?

@ricardorei
Copy link
Collaborator

hmmm and what happens on the second epoch? I actually never noticed this...

@zouharvi
Copy link
Author

In the second and the next epochs it converges to ~5it/s for me (A10G with batch size 6).

@maxiek0071
Copy link

maxiek0071 commented Aug 16, 2023

Hi, I trained two reference-free QE models on in-domain data with 300k segments. One with nr_frozen_epochs=0.3 (as proposed in the config in this repo), and the other with nr_frozen_epochs=1. The rest of the parameters stayed the same.
The True Positive Rate of the prediction is lower by about 10% when using nr_frozen_epochs=1. So the model where the encoder-fine tuning takes place later, leads to worse performance.
The training was indeed faster until the first epoch, after this the "Encoder model fine-tuning" took place (as intended).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants