-
Notifications
You must be signed in to change notification settings - Fork 5
CMake Build Script - Independent Fork Edition #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
now conflicting. let me see if i can fix utils.h tmrw |
|
You can completely delete the original code and replace it with the following code. (This code is stored in It implements the vector dot product function in multiple places (a long-standing technical debt), and this is my revised version. This version has the same efficiency as the original version and is written in a modern style. |
does it include avx2/512 aswell? |
|
The compiler will automatically optimize it to make better use of the AVX. |
|
will do. |
|
It works now! |
|
Thanks for the feedback! I'm currently working on the relevant parts and will upload the code shortly. |
|
Lemme see if I can resolve conflicts. What does the blending do? |
|
Hold on, deleted the wrong code earlier... I think I did the LPC.h code? |
|
Seems to be faster? Someone compare. Could be the new L1 / L2 blend mode |
|
Yes, I noticed that with the test audio I used, the new code is 2 seconds faster than the original code. |
|
|
Noticing massive speed improvements with -O3. |
|
@slmdev I was able to encode a 57-minute-long (extremely noisy) music file in 40 minutes with --normal mode. See below. Encode: I suggest you compile with -O3, use @inschrift-spruch-raum's vector dot function, or just merge this PR (which is both, with some fixes for Linux.) The song is "The Great Bull God" by Natural Snow Buildings. Compression didn't fare well (73% size compared to original WAV) since the first half section is the noise of clipping guitars (second half is avant-folk music). Thinking of adding a "realtime" option whose only difference is that it's normal mode but it doesn't do adaptive block splitting (thus, semi-constant encoding time.) |
I would be surprised if this code is faster than my hand-written AVX function. |
I can't guarantee that it will be faster than your code under O1 and O2, but it indeed is under O3. This piece of code may seem ordinary, but it is actually the default call of And if it is not optimized in the standard library, then it will indeed be at a disadvantage under O1 and O2, such as in the case of this.Of course, this piece of code is comparable to the original code under O3. Also, I noticed that there are multiple implementations of code with the same functionality as this piece of code. |
|
I think one advantage for this function would probably be optimizations for other architectures' vector instruction sets, namely RISC-V. |
|
I usually compile with -O3 and some other options. My AVX2+FMA code is roughly 2x faster on my laptop (limited by memory throughput) than a plain loop. I would appreciate some benchmarks to proof that GCC/CLANG indeed does it better.
Why should this be (again) faster than my hand-written loop? |
The Zig toolchain uses Clang and LLVM. MSVC lib zig build -Dtarget=x86_64-windows-msvc -Doptimize=ReleaseFast -Dcpu=native
.\zig-out\bin\test.exe
dot_scalar:
res:
hand: 2300ns 124467.577710120938718
TR : 1300ns 124467.577710121113341
calc_spow:
res:
hand: 2200ns -14272248.587150927633047
TR : 1300ns -14272248.587150840088725LibC++ zig build -Doptimize=ReleaseFast -Dcpu=native
.\zig-out\bin\test.exe
dot_scalar:
res:
hand: 2900ns 832428.052216392592527
TR : 1200ns 832428.052216391894035
calc_spow:
res:
hand: 2200ns 30987676.619986657053232
TR : 1300ns 30987676.619986530393362I'm not sure where the error comes from, but the standard library should be fine. It's not surprising to see such a difference in efficiency. I've seen the power of compiler auto-optimization in many places. |
This variant uses parallel execution not SIMD. Threads are usually occupied by other parts of the program (e.g. optimization). |
Please don't overestimate the acceleration brought by parallel strategies. Parallel acceleration in pure computation code is generally useless. Even when N = 100000000, the parallel strategy is only 0.01 seconds faster than the non - parallel one, which is smaller than the time fluctuation brought by random data. Moreover, after my test, the code in Sac that calls it will not have N exceed 128. In this case, the time spent creating threads is longer than the entire calculation, and the program will not run in parallel. If you insist on rejecting it, you can add .\zig-out\bin\test.exe 100000000
dot_scalar:
res:
hand: 59789100ns 10398548.724082605913281
TR : 59322100ns 10398548.724082428961992
calc_spow:
res:
hand: 60762100ns -1070935255.432835578918457
TR : 60503700ns -1070935255.432391881942749 |
Then i can't explain the speedup unless i see the dissambly https://encode.su/threads/1137-Sac-(State-of-the-Art)-Lossless-Audio-Compression/page11 |
I understand, you are using O3 instead of Ofast. Ofast will bring a series of destructive changes. But in terms of the results, I can accept the difference of four decimal places from the original algorithm. |
I don't use fast-math but i can confirm that std::reduce_transform indeed produces better code than the simple loop |
|
Oops |
|
I feel like the branch got too off-topic. This now works with a CMake script with minimal edits to upstream. Feel free to debate on the dot function. Anecdotally, I was able to run --normal in realtime for 50 minutes of audio. Will bench with mainstream function now. |
Same file. @inschrift-spruch-raum's function seems faster. Maybe I forgot O3? |
|
Looks like I configured it wrong? Trying with proper AVX2/FMA |
|
|
After much trial and error, build system is complete |
This is a fork of the base code, latest version designed to be as similar as possible. I had to remove some macros and add an AVX_STATE to global.h, from @inschrift-spruch-raum's fork.
Compiling with AVX2 / AVX512F just requires:
Haven't tested AVX512 since my laptop doesn't support it.
Let me know if I should change anything, please. I'd like this to be in the main branch. Inscrift-spruch-raum seems to want to have their own separate fork, and that's fine. Credits to them, though.