Add a label to ParallelFor that is forwarded to TinyProfiling #3899

DerNils-git · 2024-04-19T12:24:20Z

The most expensive regions of a simulation are (probably) wrapped in an amrex::ParallelFor.
The overhead of wrapping amrex::ParallelFor into a tiny profiling call is admittedly not very high.
But it could simplify quick profiling of AMReX based codes if a label could be added directly to an
amrex::ParallelFor.

Enforcing the label would probably require to much effort for existing codes to adapt to the new interface.
But the ParallelFor could be extended by an optional string as the last argument which would
be forwarded to a the tiny profiling calls if the label is provided.
Current usage:

ParallelFor(box, numcomps,
            [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) { ... });

Extended usage:

ParallelFor(box, numcomps,
            [=] AMREX_GPU_DEVICE (int i, int j, int k, int n) { ... },"MyParallelKernel");

A possible default argument could be

an empty string. Then ParallelFor is only profiled if a label is given explicitly.
a default string. Then all ParallelFor regions wo a label are still profiled and the measurement
always provides insights into the time which is spend in a shared memory kernels.

The text was updated successfully, but these errors were encountered:

WeiqunZhang · 2024-04-24T21:26:11Z

We actually have versions of ParallelFor functions that were added with this in mind.

amrex/Src/Base/AMReX_GpuLaunchFunctsG.H

Line 239 in b752027

    
           void ParallelFor (Gpu::KernelInfo const& info, Box const& box, L const& f) noexcept

Here, Gpu::KernelInfo is defined at

amrex/Src/Base/AMReX_GpuKernelInfo.H

Line 7 in aba9891

class KernelInfo

. We could add a non-explicit constructor to Gpu::KernelInfo. Then we can do

ParallelFor("MyKernelName", box, [=] ....);

But we still need to decide what to do with the string. Note that by default the kernel launch is async. So if we use a TinyProfiler in ParallelFor, we will have to set tiny_profiler.device_synchronize_around_region=true to make the kernel launch synchronous. But that might hurt performance. Alternatively, we can just call nvtx and roctx's RangePush and RangePop. The vendor's profiling tools will get the names.

@DerNils-git Since you mentioned kokkos. Do you know what they do?

DerNils-git · 2024-04-26T07:47:25Z

If I understand the approach of the above ParallelFor correctly, a developer has to decide at the time of implementation whether a kernel should be profiled or not.
In a performance analysis it would be nice to get all kernel information by default and not depend on the developers decision, if a kernel should be profiled or not. I might not necessarily know which kernels are invoked at all during a simulation. So by default getting the information on all invoked kernels at runtime would be nice.

I am only a user of the Kokkos profiling tools. So I can only guess a bit.
Whether profiling tools should be used is be decided at runtime not at compile time. The profiling library has
the option to synchronize kernels, if I interpret the code correctly. But it depends on the profiler whether kernels are synchronized or not. That not a lot of information but looking closer into the implementation details would be a bit to much overhead for now, sorry.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a label to ParallelFor that is forwarded to TinyProfiling #3899

Add a label to ParallelFor that is forwarded to TinyProfiling #3899

DerNils-git commented Apr 19, 2024

WeiqunZhang commented Apr 24, 2024

DerNils-git commented Apr 26, 2024

Add a label to ParallelFor that is forwarded to TinyProfiling #3899

Add a label to ParallelFor that is forwarded to TinyProfiling #3899

Comments

DerNils-git commented Apr 19, 2024

WeiqunZhang commented Apr 24, 2024

DerNils-git commented Apr 26, 2024