Skip to content

20x Slowdown in hypergeom.cdf #923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mborland opened this issue Jan 23, 2023 · 24 comments · Fixed by #930
Closed

20x Slowdown in hypergeom.cdf #923

mborland opened this issue Jan 23, 2023 · 24 comments · Fixed by #930

Comments

@mborland
Copy link
Member

See: scipy/scipy#16079

Minimal example with benchmarks

It's interesting that the slowdown is reproduced by C++11 vs C++14. I am hoping this will not evolve into a larger issue as we move to C++14 as the minimum standard.

@mborland
Copy link
Member Author

Here are the values using current develop with different compilers:

Apple Clang 14 w/libc++ using C++11 on M1

./hypergeom1  0.59s user 0.00s system 67% cpu 0.880 total

Apple Clang 14 w/libc++ using C++14 on M1

./hypergeom2  11.84s user 0.02s system 97% cpu 12.108 total

GCC 12 w/libstdc++ using C++11 on M1

./hypergeom3  0.70s user 0.01s system 72% cpu 0.983 total

GCC 12 w/libstdc++ using C++14 on M1

./hypergeom4  12.36s user 0.03s system 97% cpu 12.745 total

GCC 12 w/libstdc++ using C++17 on M1

./hypergeom5  12.44s user 0.03s system 98% cpu 12.660 total

GCC 12 w/libstdc++ using C++20 on M1

./hypergeom6  12.72s user 0.03s system 97% cpu 13.123 total

@mborland
Copy link
Member Author

mborland commented Jan 23, 2023

And here are values on x86-64 (i7-1255U) using Ubuntu 20.04.5:

GCC 10 w/libstdc++ using C++11

real	0m0.619s
user	0m0.618s
sys	0m0.000s

GCC 10 w/libstdc++ using C++14

real	0m6.886s
user	0m6.881s
sys	0m0.004s

GCC 10 w/libstdc++ using C++17

real	0m5.766s
user	0m5.766s
sys	0m0.000s

The x86-64 slowdown is 10x instead of 20x on M1.

Edit: Using Boost 1.77 and 1.75 yield the same results.

@ckormanyos
Copy link
Member

Hmmm...

Couple ideas.

  • Try to rule out optimization level? Try -Os and maybe try -O2 (which is not so radical)?
  • I've seen slowdowns with constexpr previously, but a factor of 20 seems huge.
  • the clang++ compiler allows you to actually specify stuff like -fconstexpr-steps=8 or also -fconstexpr-depth=8, which I expect wil lead to compilation errors, but may give you insight into what is or is not all getting a constexpr hit?

@ckormanyos
Copy link
Member

Vague recolection about something chrono-esque going on in there.

@jzmaddock
Copy link
Collaborator

This is weird (and rather worrying), just stepping through the code I see nothing that would obviously change with std version - there's no constexpr stuff in there either @ckormanyos. It is very computationally intensive though, lot's of table lookup of the prime numbers too. Ah... I wonder if this could be a thread safety thing doing table lookup? Some hidden locking going on maybe?

@jzmaddock
Copy link
Collaborator

Cygwin x86:

C++11:

real 0m0.734s
user 0m0.640s
sys 0m0.000s

c++14:

real 0m18.668s
user 0m17.734s
sys 0m0.016s

Wow!

@mborland mborland linked a pull request Jan 23, 2023 that will close this issue
@jzmaddock
Copy link
Collaborator

Reproduced with PDF as well as CDF (the former being called by the latter internally).

@jzmaddock
Copy link
Collaborator

It looks like it might be related to prime number lookup - which uses std::array internally - were there any changes between c++11 and 14? Also noted that boost::math::prime is NOT expanded inline in C++14 at least even though it's a fairly trivial function.

@ckormanyos
Copy link
Member

ckormanyos commented Jan 23, 2023

uses std::array internally - were there any changes between c++11 and 14

There was something seemingly trivial involving the member at() (which may throw). It became constexpr, but that seems irrelevant?

@mborland
Copy link
Member Author

It looks like it might be related to prime number lookup - which uses std::array internally - were there any changes between c++11 and 14? Also noted that boost::math::prime is NOT expanded inline in C++14 at least even though it's a fairly trivial function.

The only thing that would have changed for std::array between C++11 and C++14 would be aggregate initialization rules. I doubt that would be in-play here.

@mborland
Copy link
Member Author

If you remove all of the contextual constexpr-ness from prime and make it all follow the C++11 static const and inline you get:

C++11

./hypergeom1  0.69s user 0.01s system 74% cpu 0.947 total

C++14

./hypergeom1  0.65s user 0.01s system 81% cpu 0.810 total

C++17

./hypergeom1  0.65s user 0.01s system 67% cpu 0.991 total

@jzmaddock
Copy link
Collaborator

Yup, just found that too: it's the constexpr std::array's that are totally killing things.

@jzmaddock
Copy link
Collaborator

No change by making them C style arrays.

@mborland
Copy link
Member Author

Yup, just found that too: it's the constexpr std::array's that are totally killing things.

Maybe because the storage isn't treated as static? We don't get that as an option until C++23

@ckormanyos
Copy link
Member

Maybe because the storage isn't treated as static

Sometimes with big tables such as static const std::array<std::uint16_t, 3458> a3 I've seen compilers make a lot of intermediate code in order to handle the constexpr-ness. I do not know if this is a compelling factor in this case.

This is a pending issue that I have on microcontroller code in completely different domains. I tend to avoid constexpr talbes with more than 8-16 elements.

@mborland mborland changed the title 20x Slowdown in hypergeom.cdf BREAKING CHANGE: Fix for Scipy Issue 16079 Jan 23, 2023
@mborland mborland changed the title BREAKING CHANGE: Fix for Scipy Issue 16079 20x Slowdown in hypergeom.cdf Jan 23, 2023
@mborland
Copy link
Member Author

For posterity here's a comparison of the code generated between the C++11 and C++14 version

https://godbolt.org/z/zds83adMd

@ckormanyos
Copy link
Member

ckormanyos commented Jan 24, 2023

For posterity here's a comparison of the code generated between the C++11 and C++14 version

Thanks Matt.

Is there a Boost.Math/Multiprecision coding rule emerging?

For me it's actually more than posterity. I'm wondering what the take-away is here?

  • Use constexpr std::array<> with caution?
  • Use only small constexpr std::array<>s with dimension limited to what?
  • Don't use constexpr std::array<>?
  • Use constexpr std::array<> if (and only if) the use case is known to be unequivocally and exclusively constexpr in its context?

I know that we can't hash this all out here, but I've encountered (and forgotten) and encountered (and forgotten) this problem a few times.

Cc: @jzmaddock and @mborland

@mborland
Copy link
Member Author

That's a good question. I guess I have falsely assumed that a constexpr std::array<> didn't have to be regenerated every time the function was called. In the liked PR I should be able to get a wrapper class working that would allow us to use static constexpr std::array<> with C++14 instead of C++23. It's kind of an ugly solution but it would be the best of both worlds for the end user. It helps that we are moving to C++14 in the next release anyway because it would definitely break C++11 constexpr guarantees.

jzmaddock added a commit that referenced this issue Jan 27, 2023
Big slowdown in evaluation of constexpr tables.
Fixes: #923
@valiko-ua
Copy link

@mborland After reading this: https://quuxplusone.github.io/blog/2022/07/08/inline-constexpr/
I wonder if we should mark those arrays as static constexpr inline?
Won't we have multiple copies of those arrays in every compilation unit if we don't mark them inline?

@jzmaddock
Copy link
Collaborator

These aren't global variables - they're members of a class template, so I think and certainly hope, they should be merged at link time ever since C++98. Wouldn't hurt to check though!

@valiko-ua
Copy link

These aren't global variables - they're members of a class template, so I think and certainly hope, they should be merged at link time ever since C++98. Wouldn't hurt to check though!

I mean arrays a1, a2, a3 - they are global variables declared in namespace boost::math::detail, outside of function template prime.

@jzmaddock
Copy link
Collaborator

I mean arrays a1, a2, a3 - they are global variables declared in namespace boost::math::detail, outside of function template prime.

No more, they're class members:

@valiko-ua
Copy link

Oops, sorry, I looked at the linked PR at the top.

@valiko-ua
Copy link

Does it make sense to move array values to some macros (A1_VALUES, A2_VALUES, A3_VALUES) to avoid repetition and reduce file size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants