Add kernel execution time measurement using hooks for do_bench #139

ienkovich · 2024-09-05T00:15:39Z

Our measurements on CPU using do_bench include all the overhead related to triton kernel call dispatching in Python. My experiments show it is about 50-60 microseconds and significantly affects the resulting performance numbers. While we are interested in full execution time including this overhead, we also want to evaluate generated kernel performance, because that's what we want to optimize on our backend.

To measure time spent in CPU kernel launcher, I added an option to our CPUDeviceInterface to use entry and exit hooks to measure kernel execution time. This allows us to ignore kernel call dispatching overhead. It also better matches measurements done for GPU using GPU events. IIUC proton profiler also uses these hooks for timing, so at some point, we might enable proton for CPU and use it instead.

This PR also adds additional measurement options for the vector-add tutorial with tiling and better block size tuning to utilize new feature.

The minor change in the launcher significantly improves vector-add timing for small sizes when we want to go single-threaded and don't want to pay OMP overhead to run a single kernel.

Signed-off-by: Ilya Enkovich <[email protected]>

int3 · 2024-09-05T15:30:43Z

I'm a bit wary of adding new parameters to do_bench that aren't strictly device-independent. I'm also not a fan of shoehorning hook-based timing into an Event interface which was not intended for it... plus my Inductor PR adds CPUInterface to PyTorch, so we would have to eventually mirror the changes there.

IMO we should just create a separate do_bench_events function or something, but that would require triton-lang#4496. Let me try and get someone to accept it...

ienkovich · 2024-09-06T23:59:39Z

I see your concerns. This is not something I'd plan to keep long-term and not something I'd put into PyTorch. This is more like a temporary feature to measure pure kernel performance because we have no other way to do it. Having it in a separate function is fine to me, though it would 90% repeat the existing do_bench.

int3

ah alright I'm fine with a temporary hack :)

int3 · 2024-09-05T15:17:46Z

python/tutorials/01-vector-add.py

@@ -25,6 +25,7 @@

 GPU_BLOCK_SIZE = 1024
 CPU_BLOCK_SIZE = 4096
+CPU_ST_TRESHOLD = 65536


Suggested change

CPU_ST_TRESHOLD = 65536

CPU_ST_THRESHOLD = 65536

also, what does ST stand for?

It stands for single-thread. There is no point in paying for OMP overhead to parallelize until some input size threshold. I found this one for a particular machine but of course, the exact value varies by machine. Later, I'd like to put the responsibility to autotuner, it might be a nice test for autotuner on CPU to find the best block size value for this kernel depending on input size.

minjang · 2024-09-08T22:10:26Z

third_party/cpu/backend/driver.py

  size_t N = gridX * gridY * gridZ;
+  if (N == 1) {{


As a rough heuristic, if N is small like 4 or 8, we can simply go with a single code mode. This is a good start.

It's better to have tuning through the block size and not in the launcher to provide more intuitive behavior for users and explicit control of threading through the grid. Also, it's easy to come up with a case when we have a small number of quite heavy kernel instances where such heuristics would be harmful.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add timing measurements using launch hooks for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Avoid OMP for trivial grid in CPU launcher. Signed-off-by: Ilya Enkovich <[email protected]> * Add more measurement options for vector-add tutorial. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich added 2 commits September 4, 2024 23:47

Add timing measurements using launch hooks for CPU.

5ff096a

Signed-off-by: Ilya Enkovich <[email protected]>

Avoid OMP for trivial grid in CPU launcher.

58e2e02

Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich requested review from int3 and minjang September 5, 2024 00:15

ienkovich requested a review from ptillet as a code owner September 5, 2024 00:15

ienkovich force-pushed the ienkovich/cpu/timing-hooks branch from a5f421b to d0002cb Compare September 5, 2024 00:19

int3 approved these changes Sep 7, 2024

View reviewed changes

minjang reviewed Sep 8, 2024

View reviewed changes

Add more measurement options for vector-add tutorial.

10c5025

Signed-off-by: Ilya Enkovich <[email protected]>

ienkovich force-pushed the ienkovich/cpu/timing-hooks branch from d0002cb to 10c5025 Compare September 9, 2024 15:19

ienkovich merged commit 6de10e3 into triton-lang:main Sep 9, 2024
2 checks passed

ienkovich deleted the ienkovich/cpu/timing-hooks branch September 9, 2024 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kernel execution time measurement using hooks for do_bench #139

Add kernel execution time measurement using hooks for do_bench #139

ienkovich commented Sep 5, 2024

int3 commented Sep 5, 2024

ienkovich commented Sep 6, 2024

int3 left a comment

int3 Sep 5, 2024

ienkovich Sep 8, 2024

minjang Sep 8, 2024

ienkovich Sep 9, 2024

Add kernel execution time measurement using hooks for do_bench #139

Add kernel execution time measurement using hooks for do_bench #139

Conversation

ienkovich commented Sep 5, 2024

int3 commented Sep 5, 2024

ienkovich commented Sep 6, 2024

int3 left a comment

Choose a reason for hiding this comment

int3 Sep 5, 2024

Choose a reason for hiding this comment

ienkovich Sep 8, 2024

Choose a reason for hiding this comment

minjang Sep 8, 2024

Choose a reason for hiding this comment

ienkovich Sep 9, 2024

Choose a reason for hiding this comment