CMake Build Script - Independent Fork Edition #17

lpn256 · 2025-06-06T17:25:10Z

This is a fork of the base code, latest version designed to be as similar as possible. I had to remove some macros and add an AVX_STATE to global.h, from @inschrift-spruch-raum's fork.

Compiling with AVX2 / AVX512F just requires:

cmake .. -DENABLE_AVX256=ON

cmake .. -DENABLE_AVX512=ON

Haven't tested AVX512 since my laptop doesn't support it.

Let me know if I should change anything, please. I'd like this to be in the main branch. Inscrift-spruch-raum seems to want to have their own separate fork, and that's fine. Credits to them, though.

lpn256 · 2025-06-11T09:00:06Z

now conflicting. let me see if i can fix utils.h tmrw

inschrift-spruch-raum · 2025-06-11T09:08:12Z

You can completely delete the original code and replace it with the following code.

dot_scalar(const span_cf64 &v1,const span_cf64 &v2) {
    if (v1.size() != v2.size()) throw std::invalid_argument("invalid_argument");
    return std::transform_reduce(
        v1.begin(), v1.end(),
        v2.begin(),
        0.0,
        std::plus<>(),
        std::multiplies<>()
    );

(This code is stored in /common/math.h.)

It implements the vector dot product function in multiple places (a long-standing technical debt), and this is my revised version.

This version has the same efficiency as the original version and is written in a modern style.

lpn256 · 2025-06-11T09:09:47Z

You can completely delete the original code and replace it with the following code.
dot_scalar(const span_cf64 &v1,const span_cf64 &v2) {
    if (v1.size() != v2.size()) throw std::invalid_argument("invalid_argument");
    return std::transform_reduce(
        v1.begin(), v1.end(),
        v2.begin(),
        0.0,
        std::plus<>(),
        std::multiplies<>()
    );
(This code is stored in /common/math.h.)

It implements the vector dot product function in multiple places (a long-standing technical debt), and this is my revised version.

This version has the same efficiency as the original version and is written in a modern style.

does it include avx2/512 aswell?

inschrift-spruch-raum · 2025-06-11T09:11:54Z

The compiler will automatically optimize it to make better use of the AVX.

lpn256 · 2025-06-11T09:14:53Z

will do.

lpn256 · 2025-06-11T10:03:10Z

It works now!

inschrift-spruch-raum · 2025-06-11T10:05:19Z

Thanks for the feedback! I'm currently working on the relevant parts and will upload the code shortly.

lpn256 · 2025-06-20T05:57:43Z

Lemme see if I can resolve conflicts. What does the blending do?

lpn256 · 2025-06-20T06:06:17Z

Hold on, deleted the wrong code earlier... I think I did the LPC.h code?

lpn256 · 2025-06-20T06:36:15Z

Seems to be faster? Someone compare. Could be the new L1 / L2 blend mode

inschrift-spruch-raum · 2025-06-20T06:44:37Z

Yes, I noticed that with the test audio I used, the new code is 2 seconds faster than the original code.

lpn256 · 2025-06-20T06:51:23Z

mothwoman@mothwoman-msi:~/Desktop/Programming/sacmake/build$ sha256sum eusapia.wav eusapia2.wav
1452ab3b42303e838d6a9481935a2bd91340535472530cdd2435ad9dc8f741f9  eusapia.wav
1452ab3b42303e838d6a9481935a2bd91340535472530cdd2435ad9dc8f741f9  eusapia2.wav

lpn256 · 2025-07-01T06:23:23Z

Noticing massive speed improvements with -O3.

lpn256 · 2025-07-01T07:26:57Z

@slmdev I was able to encode a 57-minute-long (extremely noisy) music file in 40 minutes with --normal mode. See below.

Encode:

  152898816/152898816: 100.0%
  Timing:  pred 59.61%, enc 29.52%, misc 10.86%
  MD5:     dc3019e7416c7cabec0be8d334192

  611595810->450848254=73.7% (11.795 bps)  1.422x

  Time:    [00:40:38]

I suggest you compile with -O3, use @inschrift-spruch-raum's vector dot function, or just merge this PR (which is both, with some fixes for Linux.)

The song is "The Great Bull God" by Natural Snow Buildings. Compression didn't fare well (73% size compared to original WAV) since the first half section is the noise of clipping guitars (second half is avant-folk music).

Thinking of adding a "realtime" option whose only difference is that it's normal mode but it doesn't do adaptive block splitting (thus, semi-constant encoding time.)

slmdev · 2025-07-12T05:07:58Z

You can completely delete the original code and replace it with the following code.

dot_scalar(const span_cf64 &v1,const span_cf64 &v2) {
    if (v1.size() != v2.size()) throw std::invalid_argument("invalid_argument");
    return std::transform_reduce(
        v1.begin(), v1.end(),
        v2.begin(),
        0.0,
        std::plus<>(),
        std::multiplies<>()
    );

I would be surprised if this code is faster than my hand-written AVX function.

inschrift-spruch-raum · 2025-07-12T07:42:26Z

I would be surprised if this code is faster than my hand-written AVX function.

I can't guarantee that it will be faster than your code under O1 and O2, but it indeed is under O3. This piece of code may seem ordinary, but it is actually the default call of transform_reduce(). In other words, it is highly used and may be highly optimized in the standard library implementation (which is indeed the case in MSVC).

And if it is not optimized in the standard library, then it will indeed be at a disadvantage under O1 and O2, such as in the case of this.Of course, this piece of code is comparable to the original code under O3.

Also, I noticed that there are multiple implementations of code with the same functionality as this piece of code.

lpn256 · 2025-07-12T07:46:58Z

I think one advantage for this function would probably be optimizations for other architectures' vector instruction sets, namely RISC-V.

slmdev · 2025-07-12T09:48:34Z

I usually compile with -O3 and some other options.

My AVX2+FMA code is roughly 2x faster on my laptop (limited by memory throughput) than a plain loop.
I even added some tricks because i know exactly which arrays are aligned and thus can use different load operators.

I would appreciate some benchmarks to proof that GCC/CLANG indeed does it better.
I guess you'll run quickly into the limit of your memory subsystem so it looks like it is comparable.

such as in the case of this.Of >course, this piece of code is comparable to the original code under O3.

Why should this be (again) faster than my hand-written loop?

inschrift-spruch-raum · 2025-07-12T13:54:40Z

I would appreciate some benchmarks to proof that GCC/CLANG indeed does it better.

The Zig toolchain uses Clang and LLVM.

MSVC lib

zig build -Dtarget=x86_64-windows-msvc -Doptimize=ReleaseFast -Dcpu=native
.\zig-out\bin\test.exe
dot_scalar:
  res: 
  hand: 2300ns 124467.577710120938718
  TR :  1300ns 124467.577710121113341
calc_spow:
  res: 
  hand: 2200ns -14272248.587150927633047
  TR :  1300ns -14272248.587150840088725

LibC++

zig build -Doptimize=ReleaseFast -Dcpu=native
.\zig-out\bin\test.exe
dot_scalar:
  res: 
  hand: 2900ns 832428.052216392592527
  TR :  1200ns 832428.052216391894035
calc_spow:
  res: 
  hand: 2200ns 30987676.619986657053232
  TR :  1300ns 30987676.619986530393362

Code

I'm not sure where the error comes from, but the standard library should be fine.

It's not surprising to see such a difference in efficiency. I've seen the power of compiler auto-optimization in many places.

slmdev · 2025-07-15T13:27:34Z

return std::transform_reduce(
v1.begin(), v1.end(),
v2.begin(),
0.0,
std::plus<>(),
std::multiplies<>()

This variant uses parallel execution not SIMD. Threads are usually occupied by other parts of the program (e.g. optimization).

inschrift-spruch-raum · 2025-07-15T16:08:00Z

This variant uses parallel execution not SIMD. Threads are usually occupied by other parts of the program (e.g. optimization).

Please don't overestimate the acceleration brought by parallel strategies. Parallel acceleration in pure computation code is generally useless. Even when N = 100000000, the parallel strategy is only 0.01 seconds faster than the non - parallel one, which is smaller than the time fluctuation brought by random data. Moreover, after my test, the code in Sac that calls it will not have N exceed 128. In this case, the time spent creating threads is longer than the entire calculation, and the program will not run in parallel. If you insist on rejecting it, you can add std::execution::seq in the code to make it single - threaded by default. Here also provides a result when N = 100000000 and std::execution::seq.

.\zig-out\bin\test.exe 100000000
dot_scalar:
  res: 
  hand: 59789100ns 10398548.724082605913281
  TR :  59322100ns 10398548.724082428961992
calc_spow:
  res: 
  hand: 60762100ns -1070935255.432835578918457
  TR :  60503700ns -1070935255.432391881942749

slmdev · 2025-07-15T16:49:50Z

This variant uses parallel execution not SIMD. Threads are usually occupied by other parts of the program (e.g. optimization).

Please don't overestimate the acceleration brought by parallel strategies.

Then i can't explain the speedup unless i see the dissambly
Please benchmark e.g. Sac --high and report results, as i did not see such improvements in my own tests

https://encode.su/threads/1137-Sac-(State-of-the-Art)-Lossless-Audio-Compression/page11

inschrift-spruch-raum · 2025-07-15T17:24:04Z

Then i can't explain the speedup unless i see the dissambly Please benchmark e.g. Sac --high and report results, as i did not see such improvements in my own tests

I understand, you are using O3 instead of Ofast.

Ofast will bring a series of destructive changes. But in terms of the results, I can accept the difference of four decimal places from the original algorithm.

slmdev · 2025-07-15T18:34:03Z

Then i can't explain the speedup unless i see the dissambly Please benchmark e.g. Sac --high and report results, as i did not see such improvements in my own tests

I understand, you are using O3 instead of Ofast.

Ofast will bring a series of destructive changes. But in terms of the results, I can accept the difference of four decimal places from the original algorithm.

I don't use fast-math but i can confirm that std::reduce_transform indeed produces better code than the simple loop
06:04 -O3
05:14 -O3 +std::transform_reduce for dot and calc_spow
04:41 -O3 -mavx2 -mfma +std::transform_reduce
03:50 -O3 -mavx2 -mfma + AVX2 by hand

lpn256 · 2025-07-15T18:49:49Z

Oops

lpn256 · 2025-07-15T18:51:43Z

I feel like the branch got too off-topic. This now works with a CMake script with minimal edits to upstream. Feel free to debate on the dot function. Anecdotally, I was able to run --normal in realtime for 50 minutes of audio. Will bench with mainstream function now.

lpn256 · 2025-07-15T22:39:12Z

  152898816/152898816: 100.0%
  Timing:  pred 91.96%, enc 5.88%, misc 2.16%
  MD5:     dc3019e7416c7cabec0be8d334192

  611595810->450848252=73.7% (11.795 bps)  0.282x

  Time:    [03:24:59]

Same file. @inschrift-spruch-raum's function seems faster. Maybe I forgot O3?

lpn256 · 2025-07-15T22:44:49Z

Looks like I configured it wrong? Trying with proper AVX2/FMA

lpn256 · 2025-07-16T00:29:40Z

  152898816/152898816: 100.0%
  Timing:  pred 58.50%, enc 30.39%, misc 11.12%
  MD5:     dc3019e7416c7cabec0be8d334192

  611595810->450848252=73.7% (11.795 bps)  1.457x

  Time:    [00:39:39]

lpn256 · 2025-07-29T20:53:25Z

After much trial and error, build system is complete

lpn256 closed this Jul 15, 2025

lpn256 force-pushed the master branch from 7b94352 to 0089bef Compare July 15, 2025 18:47

lpn256 added 2 commits July 15, 2025 11:47

Redo CMake

f042340

Std fix

483a53a

lpn256 reopened this Jul 15, 2025

lpn256 added 2 commits July 15, 2025 16:00

Update CMakeLists.txt

bbdbf93

Whoops

636305f

lpn256 added 19 commits July 29, 2025 12:41

Merge branch 'slmdev:master' into master

881ba8a

CMake Build test

39db23a

CMake Action AVX2

85623b2

Try and fix build gcc

ef83dea

MinGW

03c49e5

Error

ec281d0

Update cmake-multi-platform.yml

70714b1

Update cmake-multi-platform.yml

4e6b976

Install MinGW

a7b3832

Update cmake-multi-platform.yml

e014d51

Whoops - See if this works

20de68d

Update cmake-multi-platform.yml

d675961

Brackets

20d2d7f

Just get rid of windows builds entirely, that's for later.

06dfdbd

Artifacts v1

1a18275

Artifacts v1 fix

45f7369

Build dir

c36e25a

Clang fix???

055d6c6

Overwrite build

aa225e6

CMake Build Script - Independent Fork Edition #17

Are you sure you want to change the base?

CMake Build Script - Independent Fork Edition #17

Uh oh!

Conversation

lpn256 commented Jun 6, 2025

Uh oh!

lpn256 commented Jun 11, 2025

Uh oh!

inschrift-spruch-raum commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpn256 commented Jun 11, 2025

Uh oh!

inschrift-spruch-raum commented Jun 11, 2025

Uh oh!

lpn256 commented Jun 11, 2025

Uh oh!

lpn256 commented Jun 11, 2025

Uh oh!

inschrift-spruch-raum commented Jun 11, 2025

Uh oh!

lpn256 commented Jun 20, 2025

Uh oh!

lpn256 commented Jun 20, 2025

Uh oh!

lpn256 commented Jun 20, 2025

Uh oh!

inschrift-spruch-raum commented Jun 20, 2025

Uh oh!

lpn256 commented Jun 20, 2025

Uh oh!

lpn256 commented Jul 1, 2025

Uh oh!

lpn256 commented Jul 1, 2025

Uh oh!

slmdev commented Jul 12, 2025

Uh oh!

inschrift-spruch-raum commented Jul 12, 2025

Uh oh!

lpn256 commented Jul 12, 2025

Uh oh!

slmdev commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inschrift-spruch-raum commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slmdev commented Jul 15, 2025

Uh oh!

inschrift-spruch-raum commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slmdev commented Jul 15, 2025

Uh oh!

inschrift-spruch-raum commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slmdev commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lpn256 commented Jul 15, 2025

Uh oh!

lpn256 commented Jul 15, 2025

Uh oh!

lpn256 commented Jul 15, 2025

Uh oh!

lpn256 commented Jul 15, 2025

Uh oh!

lpn256 commented Jul 16, 2025

Uh oh!

lpn256 commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

inschrift-spruch-raum commented Jun 11, 2025 •

edited

Loading

slmdev commented Jul 12, 2025 •

edited

Loading

inschrift-spruch-raum commented Jul 12, 2025 •

edited

Loading

inschrift-spruch-raum commented Jul 15, 2025 •

edited

Loading

inschrift-spruch-raum commented Jul 15, 2025 •

edited

Loading

slmdev commented Jul 15, 2025 •

edited

Loading