Memory Leak in FrechetInceptionDistance
if used in training_step
#1959
Labels
Milestone
FrechetInceptionDistance
if used in training_step
#1959
🐛 Bug
If one uses the
FrechetInceptionDistance
in thetraining_step
of aLightningModule
, one can observe an increase in memory consumption due to the backboneInceptionV3
, that is not freed afterwards. In the example below, it is approx. 1.2 GB of memory (or 50% more), which does not get freed.To Reproduce
Run the following code and monitor the memory consumption with
nvidia-smi
(not the best monitoring tool but good general direction and is also in accordance with the CUDA OOM Errors encountered).Expected behavior
No/minimal temporary increase in memory consumption, which is freed as soon as
self.inception
has finalised itsforward
step, as the model should be ineval
-mode and no activations or any immediate results will be saved but rather the update of the internal state of theFrechetInceptionDistance
({fake,real}_features_cov_sum
, etc.). This internal state however is pre-allocated (according to my understanding) in__init__
withadd_state
and should therefore already have its memory allocated.Environment
Additional context
I recently changed from sampling in the
validation_step
to sampling at the end of atraining_epoch
in the lasttraining_step
. Since then, I have observed the increase in memory consumption (CUDA OOM error by using the samebatch_size
that usually worked).Furthermore, using
torch.no_grad()
decorator or context manager and/or manually settingself.inception
of the metric toeval()
did not change anything.The absolute weirdest part is that sometimes the memory gets consumed (typically after debugging for a while) and sometimes it does not.
The text was updated successfully, but these errors were encountered: