Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Kernels Assignments #337

Open
6 of 35 tasks
daniellowell opened this issue Jul 9, 2020 · 5 comments
Open
6 of 35 tasks

Dynamic Kernels Assignments #337

daniellowell opened this issue Jul 9, 2020 · 5 comments

Comments

@daniellowell
Copy link
Contributor

daniellowell commented Jul 9, 2020

Purpose

This project exists to minimize our reliance on compile time parameterization in MIOpen's source kernels. The goal isn't to sacrifice performance, but rather determine a ways of reducing the compile time overhead of the first time iteration of neural networks using MIOpen.

Strategy

For some of these kernels the task is pretty straight forward; take the compile time parameters and move them into runtime parameters. In some cases this can be done without affecting performance. It may be the case, and often, that all compile time parameters may not be moved into the runtime without seriously affecting performance. In those cases we should identify those parameters that networks would change least frequently such that compiles are minimized. If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.

Priority Tasks

Data collection

  • Compile time impact for various networks
  • Performance impact of compile time modifications
  • Data collection on solver usage

Structural

Convolution Changes

Priority: HIGH

Non-Convolution Changes

Priority: HIGH

  • Batch Normalization ROCm 3.8 - 3.9
    • Fwd-spatial training (@muralinr)
      • variant 0 ROCm 3.9
      • variant 1 ROCm 3.8
      • variant 2 (Only make minor modifications)
      • variant 3 ROCm 3.8
    • Bwd-spatial training (@muralinr)
      • variant 0 ROCm 3.9
      • variant 1 ROCm 3.8
      • variant 2 (Only make minor modifications)
      • variant 3 ROCm 3.8
    • Spatial Inference: (@daniellowell ) ROCm 3.8

Priority: MEDIUM

  • copyTensor / castTensor / setTensor / scaleTensor ROCm 3.8

    • MIOpenSubTensorOpWithSubTensorKernel.cl
  • subSample / upSample (@alexandraBara) ROCm 3.8

    • MIOpenUtilKernels3.cl
  • TensorOps (@ce1adon) ROCm 3.8

  • Activations (@cderb) ROCm 3.8

    • MIOpenNeuron.cl
  • transpose_NCHW2CNHW / transpose_CNHW2NCHW ROCm 3.8

    • MIOpenUtilKernels4.cl
  • RNN / RNN Update (@ce1adon) ROCm 3.8

  • Pooling

    • Convert those most used to more dynamic ASM or composable kernel techniques (???)
@atamazov
Copy link
Contributor

If remaining compile-time parameters do not reduce significantly the number of compiles, then it may be the case the the kernel should be converted to assembly code.

Because assembling is much faster than OCL/HIP compilation?

@carlushuang
Copy link
Contributor

carlushuang commented Jul 10, 2020

Basically speaking, if source code is .s assembly file, only need do assembly phase. If source code is HIP/OCL, need go front-end-> IR -> back-end, compile time should be much longer. So basically convert the static kernel to dynamic kernel can already save a great lot of time, weather it is in HIP or OCL dynamic.

The decision to choose [1] ASM-dynamic or [2] HIP/OCL-dynamic I think should based on following factors:

  • If [2] performance is OK (within 10% drop), then keep [2]
  • If performance of [2] is too low, like compiler generate a lot of scratch buffer, should choose [1]
  • If compiling [2] is about >3 times slower than [1], should choose [1]

So from my humble experience, we can have following preference:

  • memory bound kernel(BN, pooling, dropout, utility), [2] > [1]
  • compute bound kernel(conv), [1] > [2]

@atamazov
Copy link
Contributor

@carlushuang Thanks for explanations. Just in case: AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals). Therefore even linear transformation from HIP/OCL to ASM (without adding any "dynamism") would yield substantial acceleration. Of course, extending the coverage of a kernel (making it more "dynamic" than before) is the preferred way because it also saves space in the binary cache.

@sabreshao
Copy link

sabreshao commented Jul 29, 2020

AFAICS the assembly builds are ~100 times faster than HIP builds and ~15 times faster than OCL builds (you can try auto-tuning and see how many kernels fit into 3 second logging intervals).

May I know the average compilation time for HIP/OCL/ASM kernel? I saw HIP takes 4s or more and ASM takes 100ms-200ms. Given much of kernel time is below 100ms, if each config only run once dynamic kernel still beats static ones.
@daniellowell how do you plan to support mask rcnn/retinanet? Can we run these two with a env var such as MIOPEN_FIND_MODE=Fast to pick up only dynamic kernel?

@daniellowell
Copy link
Contributor Author

@sabreshao The initial push for this is to support mask-rcnn and retinanet type networks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants