-
Notifications
You must be signed in to change notification settings - Fork 38
Bestla Kernels understanding and benchmarking #289
Comments
It depends on the epilogue classes,
No env variable. Better run benchmark on the socket with one |
Thanks @luoyu-intel for the clarification. I have some follow-up questions on the same which looks interesting. I have been using this benchmarking infra provided in the repo With this infra I have benchmarked bestla kernels for u8s8s32 The Bestla Kernels are run with Question 1: At Bestla side, The Benchmark Infra that is being used to get OP level timing for different ISAs (./bin/bestla_benchmark). Would like to confirm If I can proceed further with the above script/infra for more OP level analysis?
Neural speed provides functionality called tensor parallelism, Beslta also provides parallelism functionality using parallel template classes. Question: Is parallelization taken care by Bestla or Neural Speed or Neural Speed followed by Bestla micro kernels? |
What do you mean "more OP level analysis"?
BesTLA was developed in a tiny group of Intel (~3 people) but has covered all ISAs since AVX2. So we are not able to make it as fast as OneDNN on arbitrary devices with arbitrary cores and arbitrary problem sizes. Our highlight is supporting other low-bits by cpp templates, like: int3,int4,int5.
TP is done by Neural Speed. To better support Intel's new Xeon CPU, we will support it inside BesTLA. |
Thanks @luoyu-intel for the quick response.
I was referring to run with more arbitrary problem sizes and observe its behavior. So we can continue with below infra to run for arbitrary problem sizes (specifically with low-bits by cpp templates to observe its impact) ? - https://github.com/intel/neural-speed/tree/main/bestla
Question: Do you have any suggestions on device, cores and problem sizes where we can observe BesTLA performing better than OneDNN? |
Yes, you can add the problem sizes to benchmark's source code and then compile and run it. We are not planning to provide benchdnn-like cli parameters.
I'd like to suggest work on this Scheduler class. It schedules problem sizes to each core and do the cache blocking work. It may have 10% performance impact if you optimize the schedule for one problem size. |
Sure @luoyu-intel, Thanks Here is my use case, I am trying to run llama model from hugging face with low precision data types(int8, int4) through ipex llm and other libraries. Based on above discussion
Question 1: In order to achieve the best performance with INT8 would you suggest to use OneDNN over Bestla (As here the focus is towards other low precision data types) and compare against ipex llm? Question 2: With INT4 dtype would you suggest to use Bestla kernels to get best performance over ipex llm ? |
oneDNN requires activation reroder for many cases on both CPU and GPU, but benchdnn does not include the reorder process (as I remember). So I'm not sure about this.
I'm not familiar with ipex llm's int4 performance. |
Thanks @luoyu-intel.
I am trying to utilize the int4 kernels from benchmark's source code and then compile (https://github.com/intel/neural-speed/tree/main/bestla --> bestla/bestla/ut/bestla_benchmark.cpp)
Epilogue: neural-speed/bestla/bestla/bestla_epilogue.h Line 156 in 97c8190
Question1: I was looking if we have some API that does writeback from F32 to (8 bit and 4 bit). Do we have any API which supports the above case? Prologue: Question3: As we have direct class for (u8s8s32, s8s8s32) do we have any class similar to that for INT4? |
In OneDNN with Low precision datatype, we have support for u8s8s8 datatype. In Bestla Benchmark Infra we can find couple of classes for low precision types that includes (u8s8s32, s8s8s32 and some classes with different clip dtypes) - Ref: https://github.com/intel/neural-speed/blob/main/bestla/bestla/ut/bestla_benchmark.cpp
Question: With In Bestla do we have support only for output s32 (i.e u8s8s32/s8s8s32) or do we have also support for output s8 (i.e u8s8s8/s8s8s8)?
For Bestla Benchmark we have instructions here to build and benchmark with Bestla kernels (Ref: https://github.com/intel/neural-speed/tree/main/bestla#benchmark)
Question: Do we have any specific env variables that needs to be set to get best performance out of Bestla Kernels
The text was updated successfully, but these errors were encountered: