-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting SIMD plugins #1
Comments
Great to hear that you want to port plugins! For Linux and macOS x86_64 the SIMD optimizations can almost certainly be used as they are (only exception: pure assembly files that us the Windows-sytle calling convention hard coded, those would have to be adopted). I was able to compile the original plugin on Linux without any problem. |
Great to hear that. I would give it a try right away. btw, I'm now aiming to port all frequently used plugins to arm macos. It would be great if you have a documentation of this project so others like me can contribute more easily. |
This is currently experimental, my goal is to
Please note that for most plugins porting is not really needed, if they don't include any SIMD they will either be directly compileable or need some minor fixes in the build system. I would highly encourage you to not create forks, but rather fixes that can be merged in the plug-in repo and have a plugin that can compile on many platforms. In case the plugin is not actively maintained or the author doesn't want to merge the fix, my build tool has the ability to apply a patch. |
That's exactly what I want. Now I hold these in here, simply pasting my every command.
|
I've seen that you tried my vsrepo fork: |
It's true that I can install scripts like havsfunc, but I still need to manually compile the hundreds of dependencies. I'm still working on that. |
Yes, some (very few) of them you can download from the releases here |
I've just observed 2 strange things.
Also, sse2neon neo_f3kdb runs at 743.19fps, while the non-SIMD one got 776.95fps |
That's not strange at all. It's good practice to ensure assembly code is only used on the correct platform. It's also not difficult to do that: Usually you'll write the C code first, then you figure what parts need the most time and could be optimized and you write alternative implementations of that functions in assembly. You use the C pre-processor to enable the optimized code only on the correct platform. At runtime (for x86) you detect the processor features and select the best implementation to use.
That is strange indeed. This is the result of my znedi3 experiment:
As you can tell, sse2neon gave a massive improvement. nnedi3 does have native ARM Neon assembly, therefore it is already fast on Apple Silicon. nnedi3 does not have AVX assembly, I assume that's the main reason why it's faster on x86 over nnedi3 Are you sure the SSE functions are used? Most likely you'll have compiled the SSE parts, but you don't use them, because at the point where the implementation is choosen at runtime it will use the C implementation assuming that SSE is not available on ARM. At those points you have to modify the code |
I finally find out what happened. My test script set input video as output0, while processed one at output1. So vspipe simply output the raw video and the plugins aren't even called! |
I found using -Ofast -ftree-vectorize -fopenmp gives way larger optimization compared with sse2neon. Of course, using both is even better. -O0 + sse2neo : 40fps |
I'm now holding my Macos arm plugins at https://github.com/yuygfgg/Macos_vapoursynth_plugins |
I would not compile with -Ofast: This option allows reordering math calculations, this can cause rounding errors to propagate and reduce quality. -O0 is of course bad, it means no optimization at all, but fast compilation. I would recommend -O3 (this includes -ftree-vectorize), as it's the highest optimization level that is still standard conformant:
The difference between -O3 and -Ofast is probably also not that big (in general the differences get smaller at higher optimization levels), have you tried that? |
yeah I have realized that I'm using Ofast only after testing now. O3 is often a tiny bit slower (for BM3D, 8.56fps vs 8.13 fps on M2pro) btw. BM3D always output differently using neon and C, both sse2neon (all precision flag on) and my handwritten neon version. The difference is larger than that between SSE and C. |
I'm also interested in porting VS plugins to macos and linux, especially Apple Silicon Macos. However, I faced great difficulty with hard-coded SIMD plugins, which failed to compile on non-x86 platforms. Currently I have to manually modify the code to remove all these SIMD optimizations. Do you have any ideas?
Here's an example of the ported plugin: https://github.com/yuygfgg/neo_f3kdb_crossplatform
The text was updated successfully, but these errors were encountered: