Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting SIMD plugins #1

Open
yuygfgg opened this issue Aug 17, 2024 · 14 comments
Open

Porting SIMD plugins #1

yuygfgg opened this issue Aug 17, 2024 · 14 comments

Comments

@yuygfgg
Copy link

yuygfgg commented Aug 17, 2024

I'm also interested in porting VS plugins to macos and linux, especially Apple Silicon Macos. However, I faced great difficulty with hard-coded SIMD plugins, which failed to compile on non-x86 platforms. Currently I have to manually modify the code to remove all these SIMD optimizations. Do you have any ideas?

Here's an example of the ported plugin: https://github.com/yuygfgg/neo_f3kdb_crossplatform

@Stefan-Olt
Copy link
Owner

Stefan-Olt commented Aug 17, 2024

Great to hear that you want to port plugins!

For Linux and macOS x86_64 the SIMD optimizations can almost certainly be used as they are (only exception: pure assembly files that us the Windows-sytle calling convention hard coded, those would have to be adopted). I was able to compile the original plugin on Linux without any problem.
For SSE intrinsics like in this plugin there is a simple solution for ARM processors like Apple Silicon: sse2neon: https://github.com/DLTcollab/sse2neon
It can convert all the SSE SIMD instructions to NEON SIMD instructions. It's of course not the optimal solution, as not all SSE instructions can be mapped directly to NEON and need multiple instructions, while some NEON instructions that could improve speed that do not have an SSE equivalent aren't used. But I still noticed a massive speed improvement, it is used for example in mvtools (notice that mvtools also uses hand-written aarch64 assembly taken from x264).
All that's needed is to include the sse2neon.h file, nothing needs to be installed. I would do that using the pre-processor, if it's a ARM/aarch64 platform, include sse2neon, otherwise include the x86 SSE headers.
In general I would not remove any code, the pre-processor can disable code in certain scenarios. This way there is a chance that you can create a patch that is accepted upstream and you don't have to maintain a separate plugin

@yuygfgg
Copy link
Author

yuygfgg commented Aug 17, 2024

Great to hear that you want to port plugins!

For Linux and macOS x86_64 the SIMD optimizations can almost certainly be used as they are (only exception: pure assembly files that us the Windows-sytle calling convention hard coded, those would have to be adopted). I was able to compile the original plugin on Linux without any problem. For SSE intrinsics like in this plugin there is a simple solution for ARM processors like Apple Silicon: sse2neon: https://github.com/DLTcollab/sse2neon It can convert all the SSE SIMD instructions to NEON SIMD instructions. It's of course not the optimal solution, as not all SSE instructions can be mapped directly to NEON and need multiple instructions, while some NEON instructions that could improve speed that do not have an SSE equivalent aren't used. But I still noticed a massive speed improvement, it is used for example in mvtools (notice that mvtools also uses hand-written aarch64 assembly taken from x264). All that's needed is to include the sse2neon.h file, nothing needs to be installed. I would do that using the pre-processor, if it's a ARM/aarch64 platform, include sse2neon, otherwise include the x86 SSE headers. In general I would not remove any code, the pre-processor can disable code in certain scenarios. This way there is a chance that you can create a patch that is accepted upstream and you don't have to maintain a separate plugin

Great to hear that. I would give it a try right away.

btw, I'm now aiming to port all frequently used plugins to arm macos. It would be great if you have a documentation of this project so others like me can contribute more easily.

@Stefan-Olt
Copy link
Owner

Stefan-Olt commented Aug 17, 2024

This is currently experimental, my goal is to

  1. Get my Linux/macOS builds included in vsrepo (I already submitted a patch for Linux/macOS support, but not yet merged)
  2. Make the process a bit more automatic: Currently I create the json build definitions by hand, my goal is to have a script that can update them automatically and create them by analyzing what build system is used (most likely not perfect and minor adjustments have to be made)
  3. Improve documentation to help people getting more plugins working on Linux/macOS that are currently Windows or x86_64 focused

Please note that for most plugins porting is not really needed, if they don't include any SIMD they will either be directly compileable or need some minor fixes in the build system. I would highly encourage you to not create forks, but rather fixes that can be merged in the plug-in repo and have a plugin that can compile on many platforms. In case the plugin is not actively maintained or the author doesn't want to merge the fix, my build tool has the ability to apply a patch.

@yuygfgg
Copy link
Author

yuygfgg commented Aug 17, 2024

This is currently experimental, my goal is to

  1. Get my Linux/macOS builds included in vsrepo (I already submitted a patch for Linux/macOS support, but not yet merged)
  2. Make the process a bit more automatic: Currently I create the json build definitions by hand, my goal is to have a script that can update them automatically and create them by analyzing what build system is used (most likely not perfect and minor adjustments have to be made)
  3. Improve documentation to help people getting more plugins working on Linux/macOS that are currently Windows or x86_64 focused

Please note that for most plugins porting is not really needed, if they don't include any SIMD they will either be directly compileable or need some minor fixes in the build system. I would highly encourage you to not create forks, but rather fixes that can be merged in the plug-in repo and have a plugin that can compile on many platforms. In case the plugin is not actively maintained or the author doesn't want to merge the fix, my build tool has the ability to apply a patch.

That's exactly what I want. Now I hold these in here, simply pasting my every command.

And for sse2neon, I see it can directly replace *mmintrin.h, but how can I replace intrin.h without prefix and x86intrin.h? solved myself

@Stefan-Olt
Copy link
Owner

I've seen that you tried my vsrepo fork:
It seems that it works perfectly for you. It's database does not yet include builds for anything else than Windows, so it will tell you that there is no binary available for your platform, which is (unfortunately) fully correct. But you can already use it to install platform-independent scripts like havsfunc.
Maybe you can test that and comment on the pull request I opened with the hope it get merged with more people having tested it: vapoursynth/vsrepo#224

@yuygfgg
Copy link
Author

yuygfgg commented Aug 17, 2024

I've seen that you tried my vsrepo fork: It seems that it works perfectly for you. It's database does not yet include builds for anything else than Windows, so it will tell you that there is no binary available for your platform, which is (unfortunately) fully correct. But you can already use it to install platform-independent scripts like havsfunc. Maybe you can test that and comment on the pull request I opened with the hope it get merged with more people having tested it: vapoursynth/vsrepo#224

It's true that I can install scripts like havsfunc, but I still need to manually compile the hundreds of dependencies. I'm still working on that.

@Stefan-Olt
Copy link
Owner

Yes, some (very few) of them you can download from the releases here

@yuygfgg
Copy link
Author

yuygfgg commented Aug 17, 2024

I've just observed 2 strange things.

  1. Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.
  2. For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

Also, sse2neon neo_f3kdb runs at 743.19fps, while the non-SIMD one got 776.95fps

@Stefan-Olt
Copy link
Owner

  1. Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.

That's not strange at all. It's good practice to ensure assembly code is only used on the correct platform. It's also not difficult to do that: Usually you'll write the C code first, then you figure what parts need the most time and could be optimized and you write alternative implementations of that functions in assembly. You use the C pre-processor to enable the optimized code only on the correct platform. At runtime (for x86) you detect the processor features and select the best implementation to use.

2. For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

That is strange indeed. This is the result of my znedi3 experiment:

macOS 14 on Apple M1 Max:
nnedi3:                 66 fps
znedi3:                 16 fps
znedi3 (with sse2neon): 68 fps

Ubuntu 22.04 on Ryzen 9 5900X:
nnedi3:                 123 fps
znedi3:                 196 fps

As you can tell, sse2neon gave a massive improvement. nnedi3 does have native ARM Neon assembly, therefore it is already fast on Apple Silicon. nnedi3 does not have AVX assembly, I assume that's the main reason why it's faster on x86 over nnedi3

Are you sure the SSE functions are used? Most likely you'll have compiled the SSE parts, but you don't use them, because at the point where the implementation is choosen at runtime it will use the C implementation assuming that SSE is not available on ARM. At those points you have to modify the code

@yuygfgg
Copy link
Author

yuygfgg commented Aug 18, 2024

  1. Some of the x86 SIMD only plugins seem to compile without any modification on Arm, especially those using meson and ninja.

That's not strange at all. It's good practice to ensure assembly code is only used on the correct platform. It's also not difficult to do that: Usually you'll write the C code first, then you figure what parts need the most time and could be optimized and you write alternative implementations of that functions in assembly. You use the C pre-processor to enable the optimized code only on the correct platform. At runtime (for x86) you detect the processor features and select the best implementation to use.

  1. For these plugins I mentioned above, manually porting with sse2neon provides little performance improvement. For example, AddGain runs at 645.99fps with sse2neon, which is lower than the 671.52fps without sse2neon.

That is strange indeed. This is the result of my znedi3 experiment:

macOS 14 on Apple M1 Max:
nnedi3:                 66 fps
znedi3:                 16 fps
znedi3 (with sse2neon): 68 fps

Ubuntu 22.04 on Ryzen 9 5900X:
nnedi3:                 123 fps
znedi3:                 196 fps

As you can tell, sse2neon gave a massive improvement. nnedi3 does have native ARM Neon assembly, therefore it is already fast on Apple Silicon. nnedi3 does not have AVX assembly, I assume that's the main reason why it's faster on x86 over nnedi3

Are you sure the SSE functions are used? Most likely you'll have compiled the SSE parts, but you don't use them, because at the point where the implementation is choosen at runtime it will use the C implementation assuming that SSE is not available on ARM. At those points you have to modify the code

I finally find out what happened. My test script set input video as output0, while processed one at output1. So vspipe simply output the raw video and the plugins aren't even called!

@yuygfgg
Copy link
Author

yuygfgg commented Aug 18, 2024

I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better.

-O0 + sse2neo : 40fps
-Ofast -ftree-vectorize: 400fps
-Ofast -ftree-vectorize+ sse2neon: 590fps

@yuygfgg
Copy link
Author

yuygfgg commented Sep 5, 2024

I'm now holding my Macos arm plugins at https://github.com/yuygfgg/Macos_vapoursynth_plugins

@Stefan-Olt
Copy link
Owner

I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better.

-O0 + sse2neo : 40fps -Ofast -ftree-vectorize: 400fps -Ofast -ftree-vectorize+ sse2neon: 590fps

I would not compile with -Ofast: This option allows reordering math calculations, this can cause rounding errors to propagate and reduce quality. -O0 is of course bad, it means no optimization at all, but fast compilation. I would recommend -O3 (this includes -ftree-vectorize), as it's the highest optimization level that is still standard conformant:

-O0: No optimization at all, very fast compilation, good for debugging
-O1: Optimizations that only slightly increase compile time
-O2: All optimizations from -O1, additionally those that can increase compile time a lot more, but not increase size of output binary
-O3: All optimizations from -O2, additionally those that could increase size of output binary
-Ofast: All optimizations from -O3, additionally those that violate language specifications (like reordering math operations resulting in different results due to rounding)

The difference between -O3 and -Ofast is probably also not that big (in general the differences get smaller at higher optimization levels), have you tried that?

@yuygfgg
Copy link
Author

yuygfgg commented Oct 11, 2024

I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better.
-O0 + sse2neo : 40fps -Ofast -ftree-vectorize: 400fps -Ofast -ftree-vectorize+ sse2neon: 590fps

I would not compile with -Ofast: This option allows reordering math calculations, this can cause rounding errors to propagate and reduce quality. -O0 is of course bad, it means no optimization at all, but fast compilation. I would recommend -O3 (this includes -ftree-vectorize), as it's the highest optimization level that is still standard conformant:

-O0: No optimization at all, very fast compilation, good for debugging
-O1: Optimizations that only slightly increase compile time
-O2: All optimizations from -O1, additionally those that can increase compile time a lot more, but not increase size of output binary
-O3: All optimizations from -O2, additionally those that could increase size of output binary
-Ofast: All optimizations from -O3, additionally those that violate language specifications (like reordering math operations resulting in different results due to rounding)

The difference between -O3 and -Ofast is probably also not that big (in general the differences get smaller at higher optimization levels), have you tried that?

yeah I have realized that

I'm using Ofast only after testing now.

O3 is often a tiny bit slower (for BM3D, 8.56fps vs 8.13 fps on M2pro)

btw. BM3D always output differently using neon and C, both sse2neon (all precision flag on) and my handwritten neon version. The difference is larger than that between SSE and C.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants