Replies: 1 comment 1 reply
-
@fsolgui I'm surprised there doesn't appear to be any overlap, but at the same time it's a very simple solution so I wouldn't necessarily expect a large degree of overlap either... is it the same with dataloader memory pinned / unpinned? I'm curious to try the GIL free Python builds, looks like torch should be supporting that now / very soon... see if that unlocks any dataloader contention. One thing to note, if your dataloading is really lagging with images, for install PIL SIMD, one of the best things you can do with timm as it's using Pillow based pipelines like many torch codebases (https://github.com/uploadcare/pillow-simd). It's a bit of a pain because you constantly have to check if the simd package has been stomped over by the normal package (they have the same name so pip dep resolver will always install the original). I tend to keep separate stable, train envs that I don't touch for this reason. pip uninstall pillow
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Context
I've been developing a custom prefetcher inspired by the Prefetcher class in timm. My version uses a dataset that generates simple random numbers as seeds for GPU-based image generation via custom CUDA kernels. These generated images are then augmented using GPU-enabled torchvision transforms (e.g., RandomResizedCrop, ColorJitter), and finally passed to the model for training.
However, I'm observing throughput bottlenecks, and suspect that image generation, augmentation, and model training are not overlapping as expected — i.e., the GPU isn’t being kept busy while data is being prepared.
Profiling timm
To rule out issues in my own implementation, I took a step back and instrumented the default train.py script from the timm repository with the PyTorch Profiler (exporting traces viewable in chrome://tracing). I made minimal changes to insert the profiling logic. From the profiler output (see trace below), I noticed that there is no overlap between:
The data loading and prefetching pipeline (including normalization and random erasing)
The model's forward and backward passes
The DataLoader iteration
Questions
Is this behavior expected in timm’s current training pipeline? I was under the impression that data loading and prefetching should ideally overlap with model execution to improve throughput.
Are there any known limitations in the current pipeline design (e.g., placement of the prefetcher, blocking ops, or stream usage) that could be preventing overlap?
I observed the same non-overlapping behavior using both the PyTorch Profiler and NVIDIA NSight Systems (nsys profile). Could these tools be missing the overlap or misrepresenting timing?
Interestingly, I observed worse throughput when using the --no-prefetcher flag. How can throughput improve with the prefetcher enabled if no actual overlap is occurring?
Thanks in advance for any insight!
Beta Was this translation helpful? Give feedback.
All reactions