No overlap using dataloader workers and prefetcher #2582

fsolgui · 2025-09-16T15:15:14Z

fsolgui
Sep 16, 2025

Hi,

Context

I've been developing a custom prefetcher inspired by the Prefetcher class in timm. My version uses a dataset that generates simple random numbers as seeds for GPU-based image generation via custom CUDA kernels. These generated images are then augmented using GPU-enabled torchvision transforms (e.g., RandomResizedCrop, ColorJitter), and finally passed to the model for training.

However, I'm observing throughput bottlenecks, and suspect that image generation, augmentation, and model training are not overlapping as expected — i.e., the GPU isn’t being kept busy while data is being prepared.

Profiling timm

To rule out issues in my own implementation, I took a step back and instrumented the default train.py script from the timm repository with the PyTorch Profiler (exporting traces viewable in chrome://tracing). I made minimal changes to insert the profiling logic. From the profiler output (see trace below), I noticed that there is no overlap between:

The data loading and prefetching pipeline (including normalization and random erasing)
The model's forward and backward passes
The DataLoader iteration

Questions

Is this behavior expected in timm’s current training pipeline? I was under the impression that data loading and prefetching should ideally overlap with model execution to improve throughput.
Are there any known limitations in the current pipeline design (e.g., placement of the prefetcher, blocking ops, or stream usage) that could be preventing overlap?
I observed the same non-overlapping behavior using both the PyTorch Profiler and NVIDIA NSight Systems (nsys profile). Could these tools be missing the overlap or misrepresenting timing?
Interestingly, I observed worse throughput when using the --no-prefetcher flag. How can throughput improve with the prefetcher enabled if no actual overlap is occurring?

Thanks in advance for any insight!

rwightman · 2025-09-18T15:51:24Z

rwightman
Sep 18, 2025
Maintainer

@fsolgui I'm surprised there doesn't appear to be any overlap, but at the same time it's a very simple solution so I wouldn't necessarily expect a large degree of overlap either... is it the same with dataloader memory pinned / unpinned?

I'm curious to try the GIL free Python builds, looks like torch should be supporting that now / very soon... see if that unlocks any dataloader contention.

One thing to note, if your dataloading is really lagging with images, for install PIL SIMD, one of the best things you can do with timm as it's using Pillow based pipelines like many torch codebases (https://github.com/uploadcare/pillow-simd). It's a bit of a pain because you constantly have to check if the simd package has been stomped over by the normal package (they have the same name so pip dep resolver will always install the original). I tend to keep separate stable, train envs that I don't touch for this reason.

pip uninstall pillow
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd

1 reply

fsolgui Sep 19, 2025
Author

Hi, thanks for your reply!

The trace I shared was generated with pinned memory. I also tested with unpinned memory — the results are similar in that there’s still no overlap, and performance is even slower.
I wasn’t aware of the GIL-free Python builds — I’ll definitely look into that, thanks for the suggestion.
I’ll also take a look at PIL-SIMD, though my main bottleneck is different. The custom prefetcher I’m developing performs all the data generation and augmentation directly on the GPU, without using images at all.

What I really need is true overlap between the prefetcher running on stream 16 and the training loop on stream 7. That’s where the core performance issue lies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

No overlap using dataloader workers and prefetcher #2582

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

No overlap using dataloader workers and prefetcher #2582

Uh oh!

fsolgui Sep 16, 2025

Context

Profiling timm

Questions

Replies: 1 comment · 1 reply

Uh oh!

rwightman Sep 18, 2025 Maintainer

Uh oh!

fsolgui Sep 19, 2025 Author

fsolgui
Sep 16, 2025

Replies: 1 comment 1 reply

rwightman
Sep 18, 2025
Maintainer

fsolgui Sep 19, 2025
Author