Training loop profiler #86

lorenzoh · 2021-07-14T20:23:35Z

Using Events as hooks into the training loop, it's possible to create a profiler for training loops that measures the time spend executing events but also the time spent inbetween the events, i.e. in the training loop.

This would allow more easily identifying possible performance bottlenecks, like:

waiting on the data iterator (identifiable as time spent between the end of StepEnd and StepBegin)
moving data to the gpu during the step (does it matter or is it just as fast as doing it in the background asynchronously?)

Thoughts on implementation | This could be implemented as a callback, though you would need two callbacks one running before all the callbacks and one after the others (to measure callback times) which is unwieldy. This solution may also not play well with the asynchronous callback scheduler proposed in #85.
The imo better solution is to implement a callback execution context and does the timings before and after it runs the callbacks. It would wrap another callback execution context that it refers to, thus would also play nicely with the asynchronous callback scheduler as it would measure only the time spent on the synchronous part.

Interpretation | Events that specify start and stop points like StepBegin and StepEnd could be treated as a layer in the profiling stack. Possibly an existing package for visualizing flamegraphs could be reused to make sense of the profiling data.

The text was updated successfully, but these errors were encountered:

lorenzoh added the enhancement New feature or request label Jul 14, 2021

lorenzoh mentioned this issue Oct 18, 2021

Add training loop profiler #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training loop profiler #86

Training loop profiler #86

lorenzoh commented Jul 14, 2021 •

edited

Loading

Training loop profiler #86

Training loop profiler #86

Comments

lorenzoh commented Jul 14, 2021 • edited Loading

lorenzoh commented Jul 14, 2021 •

edited

Loading