Skip to content

Conversation

@lpn256
Copy link

@lpn256 lpn256 commented Jun 6, 2025

This is a fork of the base code, latest version designed to be as similar as possible. I had to remove some macros and add an AVX_STATE to global.h, from @inschrift-spruch-raum's fork.

Compiling with AVX2 / AVX512F just requires:

cmake .. -DENABLE_AVX256=ON
cmake .. -DENABLE_AVX512=ON

Haven't tested AVX512 since my laptop doesn't support it.

Let me know if I should change anything, please. I'd like this to be in the main branch. Inscrift-spruch-raum seems to want to have their own separate fork, and that's fine. Credits to them, though.

@lpn256
Copy link
Author

lpn256 commented Jun 11, 2025

now conflicting. let me see if i can fix utils.h tmrw

@inschrift-spruch-raum
Copy link

inschrift-spruch-raum commented Jun 11, 2025

You can completely delete the original code and replace it with the following code.

dot_scalar(const span_cf64 &v1,const span_cf64 &v2) {
    if (v1.size() != v2.size()) throw std::invalid_argument("invalid_argument");
    return std::transform_reduce(
        v1.begin(), v1.end(),
        v2.begin(),
        0.0,
        std::plus<>(),
        std::multiplies<>()
    );

(This code is stored in /common/math.h.)

It implements the vector dot product function in multiple places (a long-standing technical debt), and this is my revised version.

This version has the same efficiency as the original version and is written in a modern style.

@lpn256
Copy link
Author

lpn256 commented Jun 11, 2025

You can completely delete the original code and replace it with the following code.

dot_scalar(const span_cf64 &v1,const span_cf64 &v2) {
    if (v1.size() != v2.size()) throw std::invalid_argument("invalid_argument");
    return std::transform_reduce(
        v1.begin(), v1.end(),
        v2.begin(),
        0.0,
        std::plus<>(),
        std::multiplies<>()
    );

(This code is stored in /common/math.h.)

It implements the vector dot product function in multiple places (a long-standing technical debt), and this is my revised version.

This version has the same efficiency as the original version and is written in a modern style.

does it include avx2/512 aswell?

@inschrift-spruch-raum
Copy link

The compiler will automatically optimize it to make better use of the AVX.

@lpn256
Copy link
Author

lpn256 commented Jun 11, 2025

will do.

@lpn256
Copy link
Author

lpn256 commented Jun 11, 2025

It works now!

@inschrift-spruch-raum
Copy link

Thanks for the feedback! I'm currently working on the relevant parts and will upload the code shortly.

@lpn256
Copy link
Author

lpn256 commented Jun 20, 2025

Lemme see if I can resolve conflicts. What does the blending do?

@lpn256
Copy link
Author

lpn256 commented Jun 20, 2025

Hold on, deleted the wrong code earlier... I think I did the LPC.h code?

@lpn256
Copy link
Author

lpn256 commented Jun 20, 2025

Seems to be faster? Someone compare. Could be the new L1 / L2 blend mode

@inschrift-spruch-raum
Copy link

Yes, I noticed that with the test audio I used, the new code is 2 seconds faster than the original code.

@lpn256
Copy link
Author

lpn256 commented Jun 20, 2025

mothwoman@mothwoman-msi:~/Desktop/Programming/sacmake/build$ sha256sum eusapia.wav eusapia2.wav
1452ab3b42303e838d6a9481935a2bd91340535472530cdd2435ad9dc8f741f9  eusapia.wav
1452ab3b42303e838d6a9481935a2bd91340535472530cdd2435ad9dc8f741f9  eusapia2.wav

@lpn256
Copy link
Author

lpn256 commented Jul 1, 2025

Noticing massive speed improvements with -O3.

@lpn256
Copy link
Author

lpn256 commented Jul 1, 2025

@slmdev I was able to encode a 57-minute-long (extremely noisy) music file in 40 minutes with --normal mode. See below.

Encode:

  152898816/152898816: 100.0%
  Timing:  pred 59.61%, enc 29.52%, misc 10.86%
  MD5:     dc3019e7416c7cabec0be8d334192

  611595810->450848254=73.7% (11.795 bps)  1.422x

  Time:    [00:40:38]

I suggest you compile with -O3, use @inschrift-spruch-raum's vector dot function, or just merge this PR (which is both, with some fixes for Linux.)

The song is "The Great Bull God" by Natural Snow Buildings. Compression didn't fare well (73% size compared to original WAV) since the first half section is the noise of clipping guitars (second half is avant-folk music).

Thinking of adding a "realtime" option whose only difference is that it's normal mode but it doesn't do adaptive block splitting (thus, semi-constant encoding time.)

@slmdev
Copy link
Owner

slmdev commented Jul 12, 2025

You can completely delete the original code and replace it with the following code.

dot_scalar(const span_cf64 &v1,const span_cf64 &v2) {
    if (v1.size() != v2.size()) throw std::invalid_argument("invalid_argument");
    return std::transform_reduce(
        v1.begin(), v1.end(),
        v2.begin(),
        0.0,
        std::plus<>(),
        std::multiplies<>()
    );

I would be surprised if this code is faster than my hand-written AVX function.

@inschrift-spruch-raum
Copy link

I would be surprised if this code is faster than my hand-written AVX function.

I can't guarantee that it will be faster than your code under O1 and O2, but it indeed is under O3. This piece of code may seem ordinary, but it is actually the default call of transform_reduce(). In other words, it is highly used and may be highly optimized in the standard library implementation (which is indeed the case in MSVC).

And if it is not optimized in the standard library, then it will indeed be at a disadvantage under O1 and O2, such as in the case of this.Of course, this piece of code is comparable to the original code under O3.

Also, I noticed that there are multiple implementations of code with the same functionality as this piece of code.

@lpn256
Copy link
Author

lpn256 commented Jul 12, 2025

I think one advantage for this function would probably be optimizations for other architectures' vector instruction sets, namely RISC-V.

@slmdev
Copy link
Owner

slmdev commented Jul 12, 2025

I usually compile with -O3 and some other options.

My AVX2+FMA code is roughly 2x faster on my laptop (limited by memory throughput) than a plain loop.
I even added some tricks because i know exactly which arrays are aligned and thus can use different load operators.

I would appreciate some benchmarks to proof that GCC/CLANG indeed does it better.
I guess you'll run quickly into the limit of your memory subsystem so it looks like it is comparable.

such as in the case of this.Of >course, this piece of code is comparable to the original code under O3.

Why should this be (again) faster than my hand-written loop?

@inschrift-spruch-raum
Copy link

inschrift-spruch-raum commented Jul 12, 2025

I would appreciate some benchmarks to proof that GCC/CLANG indeed does it better.

The Zig toolchain uses Clang and LLVM.

MSVC lib

zig build -Dtarget=x86_64-windows-msvc -Doptimize=ReleaseFast -Dcpu=native
.\zig-out\bin\test.exe
dot_scalar:
  res: 
  hand: 2300ns 124467.577710120938718
  TR :  1300ns 124467.577710121113341
calc_spow:
  res: 
  hand: 2200ns -14272248.587150927633047
  TR :  1300ns -14272248.587150840088725

LibC++

zig build -Doptimize=ReleaseFast -Dcpu=native
.\zig-out\bin\test.exe
dot_scalar:
  res: 
  hand: 2900ns 832428.052216392592527
  TR :  1200ns 832428.052216391894035
calc_spow:
  res: 
  hand: 2200ns 30987676.619986657053232
  TR :  1300ns 30987676.619986530393362

Code

I'm not sure where the error comes from, but the standard library should be fine.

It's not surprising to see such a difference in efficiency. I've seen the power of compiler auto-optimization in many places.

@slmdev
Copy link
Owner

slmdev commented Jul 15, 2025

return std::transform_reduce(
v1.begin(), v1.end(),
v2.begin(),
0.0,
std::plus<>(),
std::multiplies<>()

This variant uses parallel execution not SIMD. Threads are usually occupied by other parts of the program (e.g. optimization).

@inschrift-spruch-raum
Copy link

inschrift-spruch-raum commented Jul 15, 2025

This variant uses parallel execution not SIMD. Threads are usually occupied by other parts of the program (e.g. optimization).

Please don't overestimate the acceleration brought by parallel strategies. Parallel acceleration in pure computation code is generally useless. Even when N = 100000000, the parallel strategy is only 0.01 seconds faster than the non - parallel one, which is smaller than the time fluctuation brought by random data. Moreover, after my test, the code in Sac that calls it will not have N exceed 128. In this case, the time spent creating threads is longer than the entire calculation, and the program will not run in parallel. If you insist on rejecting it, you can add std::execution::seq in the code to make it single - threaded by default. Here also provides a result when N = 100000000 and std::execution::seq.

.\zig-out\bin\test.exe 100000000
dot_scalar:
  res: 
  hand: 59789100ns 10398548.724082605913281
  TR :  59322100ns 10398548.724082428961992
calc_spow:
  res: 
  hand: 60762100ns -1070935255.432835578918457
  TR :  60503700ns -1070935255.432391881942749

@slmdev
Copy link
Owner

slmdev commented Jul 15, 2025

This variant uses parallel execution not SIMD. Threads are usually occupied by other parts of the program (e.g. optimization).

Please don't overestimate the acceleration brought by parallel strategies.

Then i can't explain the speedup unless i see the dissambly
Please benchmark e.g. Sac --high and report results, as i did not see such improvements in my own tests

https://encode.su/threads/1137-Sac-(State-of-the-Art)-Lossless-Audio-Compression/page11

@inschrift-spruch-raum
Copy link

inschrift-spruch-raum commented Jul 15, 2025

Then i can't explain the speedup unless i see the dissambly Please benchmark e.g. Sac --high and report results, as i did not see such improvements in my own tests

I understand, you are using O3 instead of Ofast.

Ofast will bring a series of destructive changes. But in terms of the results, I can accept the difference of four decimal places from the original algorithm.

@slmdev
Copy link
Owner

slmdev commented Jul 15, 2025

Then i can't explain the speedup unless i see the dissambly Please benchmark e.g. Sac --high and report results, as i did not see such improvements in my own tests

I understand, you are using O3 instead of Ofast.

Ofast will bring a series of destructive changes. But in terms of the results, I can accept the difference of four decimal places from the original algorithm.

I don't use fast-math but i can confirm that std::reduce_transform indeed produces better code than the simple loop
06:04 -O3
05:14 -O3 +std::transform_reduce for dot and calc_spow
04:41 -O3 -mavx2 -mfma +std::transform_reduce
03:50 -O3 -mavx2 -mfma + AVX2 by hand

@lpn256
Copy link
Author

lpn256 commented Jul 15, 2025

Oops

@lpn256 lpn256 reopened this Jul 15, 2025
@lpn256
Copy link
Author

lpn256 commented Jul 15, 2025

I feel like the branch got too off-topic. This now works with a CMake script with minimal edits to upstream. Feel free to debate on the dot function. Anecdotally, I was able to run --normal in realtime for 50 minutes of audio. Will bench with mainstream function now.

@lpn256
Copy link
Author

lpn256 commented Jul 15, 2025

  152898816/152898816: 100.0%
  Timing:  pred 91.96%, enc 5.88%, misc 2.16%
  MD5:     dc3019e7416c7cabec0be8d334192

  611595810->450848252=73.7% (11.795 bps)  0.282x

  Time:    [03:24:59]

Same file. @inschrift-spruch-raum's function seems faster. Maybe I forgot O3?

@lpn256
Copy link
Author

lpn256 commented Jul 15, 2025

Looks like I configured it wrong? Trying with proper AVX2/FMA

@lpn256
Copy link
Author

lpn256 commented Jul 16, 2025

  152898816/152898816: 100.0%
  Timing:  pred 58.50%, enc 30.39%, misc 11.12%
  MD5:     dc3019e7416c7cabec0be8d334192

  611595810->450848252=73.7% (11.795 bps)  1.457x

  Time:    [00:39:39]

@lpn256
Copy link
Author

lpn256 commented Jul 29, 2025

After much trial and error, build system is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants