Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: replace sample_autogen code with variadic function template #13928

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Swiftb0y
Copy link
Member

This replaces the ginormous autogenerated file a few function templates that do the same (sometimes even faster). The primary goal was to make the header smaller to potentially speedup builds (as sample.h which includes sample_autogen.h is included very often) by reducing the header size. In reality, only copyN* (where $N = 1, 2, 3$) are used in practice so I'm also fine if we just strip everything else from the file. Apart from that the function templates (while a bit difficult to read for sure) so anything the autogenerated code did while being very slightly faster as well.

Builds on my end (and can likely be made to build everywhere).

I would appreciate some opinion on how we want to proceed. I definitely want to throw out the functions we currently don't need and I'm open to whether we want to replace them with the variadic function template or not.

@Swiftb0y
Copy link
Member Author

okay, so this technically works. The build failure on MSVC looks like a compiler bug to me because it works on MSVC v19.30 (VS17) but not 19.29 (VS16).

@daschuer
Copy link
Member

that do the same (sometimes even faster).

Do you know hwere the gain for faster execution is?

to potentially speedup builds

Are you sure about this? From my experience templates solw down the build.

@Swiftb0y
Copy link
Member Author

Do you know hwere the gain for faster execution is?

Not entirely, I didn't bother digging into differences in the assembly but I did run the benchmarks to verify this didn't cause performance regressions

Benchmark Results

Old:

----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_Copy2WithGain/64         2388 ns         1500 ns            1
BM_Copy2WithGain/512         553 ns          435 ns            1
BM_Copy2WithGain/4096       1045 ns         1026 ns            1
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_Copy2WithRampingGain/64         2836 ns         1905 ns            1
BM_Copy2WithRampingGain/512         863 ns          720 ns            1
BM_Copy2WithRampingGain/4096       1603 ns         1527 ns            1

New:

----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_Copy2WithGain/64         2451 ns         1582 ns            1
BM_Copy2WithGain/512         503 ns          459 ns            1
BM_Copy2WithGain/4096       1042 ns          978 ns            1
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_Copy2WithRampingGain/64         2630 ns         1424 ns            1
BM_Copy2WithRampingGain/512         549 ns          468 ns            1
BM_Copy2WithRampingGain/4096       1127 ns         1070 ns            1

Are you sure about this? From my experience templates solw down the build.

So I've just made some benchmark and I found that the compilation times are very similar for in the low parameter count case. It gets worse exponentally worse on GCC for more parameters, but increases linearly on clang.
In practice, we really don't need this giant unfolded code and I would simply like to suggest all the unneeded implementations and just use a couple handwritten ones. I'm also not a huge fan of the templates either, nevertheless are they better than the autogenerated code IMO. We should only reach for autogenerated code if the compilers built-in facilities are not powerful enough (which they are in this case). In practice keeping this autogenerated code makes it infeasible to refactor which is why I really want to get rid of it.

It may be worth experimenting with removing the if (src.gain == 0) cascade code because thats the primary source of the template complexity and also the reason for the many instantiations on bigger inputs. Regardless, these bigger versions are not used in code anyways so its questionable how much sense it makes to argue based on the asymptotic big $N$ case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants