Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kernel execution time measurement using hooks for do_bench #139

Merged
merged 3 commits into from
Sep 9, 2024

Conversation

ienkovich
Copy link
Collaborator

Our measurements on CPU using do_bench include all the overhead related to triton kernel call dispatching in Python. My experiments show it is about 50-60 microseconds and significantly affects the resulting performance numbers. While we are interested in full execution time including this overhead, we also want to evaluate generated kernel performance, because that's what we want to optimize on our backend.

To measure time spent in CPU kernel launcher, I added an option to our CPUDeviceInterface to use entry and exit hooks to measure kernel execution time. This allows us to ignore kernel call dispatching overhead. It also better matches measurements done for GPU using GPU events. IIUC proton profiler also uses these hooks for timing, so at some point, we might enable proton for CPU and use it instead.

This PR also adds additional measurement options for the vector-add tutorial with tiling and better block size tuning to utilize new feature.

The minor change in the launcher significantly improves vector-add timing for small sizes when we want to go single-threaded and don't want to pay OMP overhead to run a single kernel.

@int3
Copy link
Collaborator

int3 commented Sep 5, 2024

I'm a bit wary of adding new parameters to do_bench that aren't strictly device-independent. I'm also not a fan of shoehorning hook-based timing into an Event interface which was not intended for it... plus my Inductor PR adds CPUInterface to PyTorch, so we would have to eventually mirror the changes there.

IMO we should just create a separate do_bench_events function or something, but that would require triton-lang#4496. Let me try and get someone to accept it...

@ienkovich
Copy link
Collaborator Author

I see your concerns. This is not something I'd plan to keep long-term and not something I'd put into PyTorch. This is more like a temporary feature to measure pure kernel performance because we have no other way to do it. Having it in a separate function is fine to me, though it would 90% repeat the existing do_bench.

Copy link
Collaborator

@int3 int3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah alright I'm fine with a temporary hack :)

@@ -25,6 +25,7 @@

GPU_BLOCK_SIZE = 1024
CPU_BLOCK_SIZE = 4096
CPU_ST_TRESHOLD = 65536
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CPU_ST_TRESHOLD = 65536
CPU_ST_THRESHOLD = 65536

also, what does ST stand for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stands for single-thread. There is no point in paying for OMP overhead to parallelize until some input size threshold. I found this one for a particular machine but of course, the exact value varies by machine. Later, I'd like to put the responsibility to autotuner, it might be a nice test for autotuner on CPU to find the best block size value for this kernel depending on input size.

size_t N = gridX * gridY * gridZ;
if (N == 1) {{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rough heuristic, if N is small like 4 or 8, we can simply go with a single code mode. This is a good start.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have tuning through the block size and not in the launcher to provide more intuitive behavior for users and explicit control of threading through the grid. Also, it's easy to come up with a case when we have a small number of quite heavy kernel instances where such heuristics would be harmful.

@ienkovich ienkovich merged commit 6de10e3 into triton-lang:main Sep 9, 2024
2 checks passed
@ienkovich ienkovich deleted the ienkovich/cpu/timing-hooks branch September 9, 2024 17:08
minjang pushed a commit that referenced this pull request Sep 22, 2024
* Add timing measurements using launch hooks for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Avoid OMP for trivial grid in CPU launcher.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add more measurement options for vector-add tutorial.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
minjang pushed a commit that referenced this pull request Oct 22, 2024
* Add timing measurements using launch hooks for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Avoid OMP for trivial grid in CPU launcher.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add more measurement options for vector-add tutorial.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
minjang pushed a commit that referenced this pull request Oct 24, 2024
* Add timing measurements using launch hooks for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Avoid OMP for trivial grid in CPU launcher.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add more measurement options for vector-add tutorial.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants