TorchIO + PyTorch Lightning when using a Queue. #602

ramonemiliani93 · 2021-07-12T18:37:23Z

ramonemiliani93
Jul 12, 2021

Hi, I wanted to ask about what is the expected behavior of the Queue when it is being used on pytorch lightning. I am trying to debug some worker killed... errors which I suspect are from running out of memory. I created a simple script (not entirely sure it is the right way) to evaluate the tensors that are in memory.

import gc
import pdb
from collections import Counter

import pytorch_lightning as pl
import torch
import torchio as tio
from torch.utils.data import DataLoader

n_subjects = 16
max_length = 40
samples_per_volume = 5
num_workers = 8
patch_size = 128
batch_size = 2

sampler = tio.data.UniformSampler(patch_size)
subject = tio.datasets.Colin27()
dataset = tio.SubjectsDataset(n_subjects * [subject])
queue = tio.Queue(dataset, max_length, samples_per_volume, sampler, num_workers)


class DummyDataModule(pl.LightningDataModule):
    def train_dataloader(self):
        return DataLoader(queue, batch_size=batch_size)


class DummyModule(pl.LightningModule):
    def configure_optimizers(self):
        pass

    def training_step(self, *args, **kwargs):
        pdb.set_trace()  # Use inspect_mem() here.


def inspect_mem():
    tensors = []
    for obj in gc.get_objects():
        if torch.is_tensor(obj) and obj.ndim == 4:
            tensors.append(obj.size())

    for item, count in Counter(tensors).items():
        print(item, count, sep=": ")


trainer = pl.Trainer(max_epochs=1, num_sanity_val_steps=0)
trainer.fit(DummyModule(), datamodule=DummyDataModule())

From what I understand of the Queue behavior the Counter should print max 40 tensors of shape [1, 128, 128, 128] (corresponding to the max_length) and max 8 tensors of shape [1, 181, 217, 181] (corresponding to the num_workers and the shape of the Colin27 images).

When I run the inspect_mem on the trace I always get 108 tensors of shape [1, 128, 128, 128] and a variable number of tensors of shape [1, 181, 217, 181], sometimes 0, 15, 18 or even 21.

Do these numbers make sense? Is there something on PyTorch Lightning + TorchIO that makes the Queue keep more data in memory?

romainVala · 2021-07-13T06:45:17Z

romainVala
Jul 13, 2021

Hello

this is an important but difficult question.
I tried in the past to follow the CPU memory usage, but it was also hard to understand.

A few hints though:
The queue is filled by the worker, but once it is done, I thing the next one get prepare (during the use of the first one)
Same for the worker, I think, they can start an other work, keeping the previous result in memory before it can be use.

This may explain a few more full volume in memory, although in your example there is no dead time to simulate GPU use, so the "use" of the data from the queue is almost instantaneous

I just realize, there is a more plausible explaination :
the colin dataset contains 3 volumes ('t1', 'head', 'brain'), so you can have up to 3*8 = 24 volume of full size, and 3 * 40 patch
then your number are below the maximum (because of no dead time ...?)

The question is difficult, because, when you then add some transformation, there will be more memory need, and how much will depend on the transformation

0 replies

romainVala · 2021-07-13T07:29:15Z

romainVala
Jul 13, 2021

Not sure also if it is the way to go, but I test an other way to monitor memory usage;
I change your example with

import resource
import time
class DummyModule(pl.LightningModule):
    def configure_optimizers(self):
        pass

    def training_step(self, *args, **kwargs):
        #pdb.set_trace()  # Use inspect_mem() here.
        time.sleep(1)
        main_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss /1000
        child_memory = resource.getrusage(resource.RUSAGE_CHILDREN).ru_maxrss /1000
        print(f'max memory peak: {main_memory + child_memory} MB')

Now with how much memory is one subject ?:

subject.brain.data.element_size()
you can see that brain and head have a size of 2 byte each but t1 has a size of 4 bites, so a subject is 8 bytes per element, so for a matrix size of 181217181 this give 56.87 Mb and a patch of 128 will give you 16.78 MB
so 8 (worker) * 56.87 + 40 (queue size) * 16.78 = 1126 MB

When I run my version I see that it start at 1062 MB and grow with iterations, I test with 50 epoch and see a max of 21000 MB so ~ twice the expect size ...

This assumption that one worker need only one full subject in memory, may be wrong, not sure what exactly happend but I see differences when changing the samples_per_volume

more sample_per_volume, need a little bit more memory
same for bath size

I do not see much difference when I play with 1 second versus no sleep in the training ...

2 replies

ramonemiliani93 Jul 15, 2021
Author

Yes, this makes a lot of sense, thanks! With this in mind the amount of tensors that I get is always below the number of sequences per image and length of the queue. While trying to debug some memory issues they pointed me here and here. I am not sure if @fepegar you had already seen something about this.

Thanks for the help!

fepegar Jul 23, 2021
Maintainer

Thanks for the references, @ramonemiliani93. I'm not an expert on multiprocessing. Do you think there's anything we should modify?

ramonemiliani93 · 2021-10-15T15:11:56Z

ramonemiliani93
Oct 15, 2021
Author

Hey @fepegar! Sorry for coming back to answer this so late! This post summarizes all the discussion I previously mentioned. I created a small snippet for you to try out (I know you are busy with your PhD, no worries!).

from multiprocessing import Manager

import matplotlib.pyplot as plt
import numpy as np
import psutil
import torchio as tio
from torch.utils.data import DataLoader
from tqdm import tqdm


class SubjectsDataset(tio.SubjectsDataset):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Only changes.
        manager = Manager()
        self._subjects = manager.list(self._subjects)


if __name__ == "__main__":

    n_subjects = 10000000
    subjects = [tio.Subject(image=tio.ScalarImage(path="")) for _ in range(n_subjects)]
    data = tio.SubjectsDataset(subjects, load_getitem=False)

    mem_used = [psutil.virtual_memory().used / 1024 ** 3]
    dl = DataLoader(data, batch_size=50, shuffle=True, pin_memory=False, num_workers=8)

    for i, item in tqdm(enumerate(dl), total=n_subjects / dl.batch_size):
        if i % 1000 == 0:
            mem = psutil.virtual_memory()
            mem_used.append(mem.used / 1024 ** 3)

    plt.plot(np.array(mem_used))
    plt.savefig("memory_used.png")

To try out the suggestion you can simply change data = tio.SubjectsDataset(subjects, load_getitem=False) for data = SubjectsDataset(subjects, load_getitem=False). I was able to reproduce the plots from here (takes some time to run).

Current memory usage:

Suggested changes:

Note that I was able to reproduce on a Linux VM but not on a Macbook. This could happen because Windows and MacOS handle multiprocessing with spawn while linux uses fork but take my words with a grain of salt since I am not an expert here.

Other thing to note is the number of subjects is a very big number; probably no existing dataset on medical imaging is that big. I haven't tried out if the same would happen with less subjects but more images, custom reader, i.e. more complex subjects instead of the dummy ones from the snippet.

2 replies

fepegar Jan 2, 2022
Maintainer

Hi, @ramonemiliani93. Thanks again for all these great investigations.

I'm trying to understand all this stuff. Why do you use load_getitem=False? I guess in that case the images are loaded when the data loader is collating the batch, which I guess doesn't involve multiprocessing and would make things slower.

I'm trying to replicate using the tio.datasets.EPISURG dataset, which might be a more realistic example.

ramonemiliani93 Jan 10, 2022
Author

Hey @fepegar! Sorry for the late reply! I used it for speed purposes (if I recall correctly the image won't be loaded and the batches will go through faster). I can take a look at using a more realistic dataset this week! Thanks for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorchIO + PyTorch Lightning when using a Queue. #602

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

TorchIO + PyTorch Lightning when using a Queue. #602

ramonemiliani93 Jul 12, 2021

Replies: 3 comments · 4 replies

romainVala Jul 13, 2021

romainVala Jul 13, 2021

ramonemiliani93 Jul 15, 2021 Author

fepegar Jul 23, 2021 Maintainer

ramonemiliani93 Oct 15, 2021 Author

fepegar Jan 2, 2022 Maintainer

ramonemiliani93 Jan 10, 2022 Author

ramonemiliani93
Jul 12, 2021

Replies: 3 comments 4 replies

romainVala
Jul 13, 2021

romainVala
Jul 13, 2021

ramonemiliani93 Jul 15, 2021
Author

fepegar Jul 23, 2021
Maintainer

ramonemiliani93
Oct 15, 2021
Author

fepegar Jan 2, 2022
Maintainer

ramonemiliani93 Jan 10, 2022
Author