Dynamic Kernels Assignments #337

daniellowell · 2020-07-09T21:11:13Z

Purpose

This project exists to minimize our reliance on compile time parameterization in MIOpen's source kernels. The goal isn't to sacrifice performance, but rather determine a ways of reducing the compile time overhead of the first time iteration of neural networks using MIOpen.

Strategy

For some of these kernels the task is pretty straight forward; take the compile time parameters and move them into runtime parameters. In some cases this can be done without affecting performance. It may be the case, and often, that all compile time parameters may not be moved into the runtime without seriously affecting performance. In those cases we should identify those parameters that networks would change least frequently such that compiles are minimized. If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.

Priority Tasks

Data collection

Compile time impact for various networks
Performance impact of compile time modifications
Data collection on solver usage

Structural

Precompiled kernels for non-convolution kernels (@JehandadKhan) ROCm 3.8
Tuna prepared for bin-cache generated kernels (@JehandadKhan, @alexandraBara, @cderb )
Expand the methodology of ASM-iGEMM conversion to non-iGEMM kernels and non-convolutional kernels (@carlushuang, @shaojiewang , @jane-zxy , @fronteer )

Convolution Changes

Priority: HIGH

Non-Convolution Changes

Priority: HIGH

Priority: MEDIUM

copyTensor / castTensor / setTensor / scaleTensor ROCm 3.8
- MIOpenSubTensorOpWithSubTensorKernel.cl
subSample / upSample (@alexandraBara) ROCm 3.8
- MIOpenUtilKernels3.cl
TensorOps (@ce1adon) ROCm 3.8
Activations (@cderb) ROCm 3.8
- MIOpenNeuron.cl
transpose_NCHW2CNHW / transpose_CNHW2NCHW ROCm 3.8
- MIOpenUtilKernels4.cl
RNN / RNN Update (@ce1adon) ROCm 3.8
Pooling
- Convert those most used to more dynamic ASM or composable kernel techniques (???)

atamazov · 2020-07-10T00:12:19Z

If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.

Because assembling is much faster than OCL/HIP compilation?

carlushuang · 2020-07-10T00:48:24Z

Basically speaking, if source code is .s assembly file, only need do assembly phase. If source code is HIP/OCL, need go front-end-> IR -> back-end, compile time should be much longer. So basically convert the static kernel to dynamic kernel can already save a great lot of time, weather it is in HIP or OCL dynamic.

The decision to choose [1] ASM-dynamic or [2] HIP/OCL-dynamic I think should based on following factors:

If [2] performance is OK (within 10% drop), then keep [2]
If performance of [2] is too low, like compiler generate a lot of scratch buffer, should choose [1]
If compiling [2] is about >3 times slower than [1], should choose [1]

So from my humble experience, we can have following preference:

memory bound kernel(BN, pooling, dropout, utility), [2] > [1]
compute bound kernel(conv), [1] > [2]

atamazov · 2020-07-10T21:30:35Z

@carlushuang Thanks for explanations. Just in case: AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals). Therefore even linear transformation from HIP/OCL to ASM (without adding any "dynamism") would yield substantial acceleration. Of course, extending the coverage of a kernel (making it more "dynamic" than before) is the preferred way because it also saves space in the binary cache.

sabreshao · 2020-07-29T06:40:40Z

AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals).

May I know the average compilation time for HIP/OCL/ASM kernel? I saw HIP takes 4s or more and ASM takes 100ms-200ms. Given much of kernel time is below 100ms, if each config only run once dynamic kernel still beats static ones.
@daniellowell how do you plan to support mask rcnn/retinanet? Can we run these two with a env var such as MIOPEN_FIND_MODE=Fast to pick up only dynamic kernel?

daniellowell · 2020-08-24T16:59:23Z

@sabreshao The initial push for this is to support mask-rcnn and retinanet type networks.

daniellowell self-assigned this Jul 9, 2020

daniellowell assigned TejashShah Jul 9, 2020

This was referenced Jul 14, 2020

Cellfft #335

Closed

add vector-load tune param to improve fp16 frwd #347

Merged

aserio added new feature performance labels Jul 21, 2020

JehandadKhan assigned JehandadKhan, TejashShah and daniellowell and unassigned TejashShah, daniellowell and JehandadKhan Aug 19, 2020

JehandadKhan mentioned this issue Aug 24, 2020

Initial Iteration Time: Overall Plan #398

Open

6 tasks

JehandadKhan unassigned TejashShah Sep 11, 2020

atamazov added enhancement and removed new feature labels Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Kernels Assignments #337

Dynamic Kernels Assignments #337

daniellowell commented Jul 9, 2020 •

edited by adityas-amd

Loading

atamazov commented Jul 10, 2020

carlushuang commented Jul 10, 2020 •

edited

Loading

atamazov commented Jul 10, 2020

sabreshao commented Jul 29, 2020 •

edited

Loading

daniellowell commented Aug 24, 2020

Dynamic Kernels Assignments #337

Dynamic Kernels Assignments #337

Comments

daniellowell commented Jul 9, 2020 • edited by adityas-amd Loading

Purpose

Strategy

Priority Tasks

Data collection

Structural

Convolution Changes

Non-Convolution Changes

atamazov commented Jul 10, 2020

carlushuang commented Jul 10, 2020 • edited Loading

atamazov commented Jul 10, 2020

sabreshao commented Jul 29, 2020 • edited Loading

daniellowell commented Aug 24, 2020

daniellowell commented Jul 9, 2020 •

edited by adityas-amd

Loading

carlushuang commented Jul 10, 2020 •

edited

Loading

sabreshao commented Jul 29, 2020 •

edited

Loading