-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add kernel execution time measurement using hooks for do_bench #139
Add kernel execution time measurement using hooks for do_bench #139
Conversation
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
a5f421b
to
d0002cb
Compare
I'm a bit wary of adding new parameters to IMO we should just create a separate |
I see your concerns. This is not something I'd plan to keep long-term and not something I'd put into PyTorch. This is more like a temporary feature to measure pure kernel performance because we have no other way to do it. Having it in a separate function is fine to me, though it would 90% repeat the existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah alright I'm fine with a temporary hack :)
python/tutorials/01-vector-add.py
Outdated
@@ -25,6 +25,7 @@ | |||
|
|||
GPU_BLOCK_SIZE = 1024 | |||
CPU_BLOCK_SIZE = 4096 | |||
CPU_ST_TRESHOLD = 65536 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU_ST_TRESHOLD = 65536 | |
CPU_ST_THRESHOLD = 65536 |
also, what does ST
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It stands for single-thread. There is no point in paying for OMP overhead to parallelize until some input size threshold. I found this one for a particular machine but of course, the exact value varies by machine. Later, I'd like to put the responsibility to autotuner, it might be a nice test for autotuner on CPU to find the best block size value for this kernel depending on input size.
size_t N = gridX * gridY * gridZ; | ||
if (N == 1) {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rough heuristic, if N is small like 4 or 8, we can simply go with a single code mode. This is a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to have tuning through the block size and not in the launcher to provide more intuitive behavior for users and explicit control of threading through the grid. Also, it's easy to come up with a case when we have a small number of quite heavy kernel instances where such heuristics would be harmful.
Signed-off-by: Ilya Enkovich <[email protected]>
d0002cb
to
10c5025
Compare
* Add timing measurements using launch hooks for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Avoid OMP for trivial grid in CPU launcher. Signed-off-by: Ilya Enkovich <[email protected]> * Add more measurement options for vector-add tutorial. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
* Add timing measurements using launch hooks for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Avoid OMP for trivial grid in CPU launcher. Signed-off-by: Ilya Enkovich <[email protected]> * Add more measurement options for vector-add tutorial. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
* Add timing measurements using launch hooks for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Avoid OMP for trivial grid in CPU launcher. Signed-off-by: Ilya Enkovich <[email protected]> * Add more measurement options for vector-add tutorial. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
Our measurements on CPU using
do_bench
include all the overhead related to triton kernel call dispatching in Python. My experiments show it is about 50-60 microseconds and significantly affects the resulting performance numbers. While we are interested in full execution time including this overhead, we also want to evaluate generated kernel performance, because that's what we want to optimize on our backend.To measure time spent in CPU kernel launcher, I added an option to our
CPUDeviceInterface
to use entry and exit hooks to measure kernel execution time. This allows us to ignore kernel call dispatching overhead. It also better matches measurements done for GPU using GPU events. IIUC proton profiler also uses these hooks for timing, so at some point, we might enable proton for CPU and use it instead.This PR also adds additional measurement options for the vector-add tutorial with tiling and better block size tuning to utilize new feature.
The minor change in the launcher significantly improves vector-add timing for small sizes when we want to go single-threaded and don't want to pay OMP overhead to run a single kernel.