[QUESTION] Why does training speed go down? #158

zouharvi · 2023-08-05T14:12:50Z

I noticed that comet-train (after encoder finetuning) has speed of ~12it/s at e.g. 30% which drops to ~7it/s at 60% and to ~6it/s at 90% of the epoch.

Is that something particular to only me or did anyone else observe this as well?
If yes, is this expected behaviour?

I'm using NVIDIA A10G GPUs and the following software versions:

Python - 3.10.9
COMET - upstream
torch - 2.0.1
pytorch-lightning - 1.9.5
transformers - 4.29.0
numpy - 1.24.3

The text was updated successfully, but these errors were encountered:

maxiek0071 · 2023-08-08T07:58:53Z

Hi zouharvi,

I noticed this behavior as well. I think it has something to do with "Encoder model fine-tuning". After this the speed gradually decreases for me from 13.98it/s to 5.85it/s at the end of the epoch.

Could someone comment if this is an excepted behavior?

zouharvi · 2023-08-08T13:49:00Z

Indeed, without encoder fine-tuning (nr_frozen_epochs=1), this does not happen. Shot in the dark: I wonder if there is some memory leak associated with that which leaves some grad-able objects on the GPU?

ricardorei · 2023-08-16T11:42:20Z

hmmm and what happens on the second epoch? I actually never noticed this...

zouharvi · 2023-08-16T12:20:45Z

In the second and the next epochs it converges to ~5it/s for me (A10G with batch size 6).

maxiek0071 · 2023-08-16T14:16:54Z

Hi, I trained two reference-free QE models on in-domain data with 300k segments. One with nr_frozen_epochs=0.3 (as proposed in the config in this repo), and the other with nr_frozen_epochs=1. The rest of the parameters stayed the same.
The True Positive Rate of the prediction is lower by about 10% when using nr_frozen_epochs=1. So the model where the encoder-fine tuning takes place later, leads to worse performance.
The training was indeed faster until the first epoch, after this the "Encoder model fine-tuning" took place (as intended).

zouharvi added the question Further information is requested label Aug 5, 2023

maxiek0071 mentioned this issue Aug 9, 2023

Multi-GPU training #159

Closed

vince62s mentioned this issue Mar 4, 2024

[QUESTION] large file scoring #206

Closed

joannacknight mentioned this issue Mar 19, 2024

Understanding Comet code alan-turing-institute/ARC-MTQE#20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Why does training speed go down? #158

[QUESTION] Why does training speed go down? #158

zouharvi commented Aug 5, 2023 •

edited

Loading

maxiek0071 commented Aug 8, 2023

zouharvi commented Aug 8, 2023

ricardorei commented Aug 16, 2023

zouharvi commented Aug 16, 2023

maxiek0071 commented Aug 16, 2023 •

edited

Loading

[QUESTION] Why does training speed go down? #158

[QUESTION] Why does training speed go down? #158

Comments

zouharvi commented Aug 5, 2023 • edited Loading

maxiek0071 commented Aug 8, 2023

zouharvi commented Aug 8, 2023

ricardorei commented Aug 16, 2023

zouharvi commented Aug 16, 2023

maxiek0071 commented Aug 16, 2023 • edited Loading

zouharvi commented Aug 5, 2023 •

edited

Loading

maxiek0071 commented Aug 16, 2023 •

edited

Loading