-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
long #include times for axom headers #872
Comments
Thanks for writing this up @samuelpmishLLNL! Some comments:
As an aside, are the nvcc timings compile-only, or compile+link? It might be worthwhile to drill down to see where we're spending all this time.
|
@kanye-quest FWIW, we've worked through the remaining issues to get RAJA to build and run properly with clang-cuda support: LLNL/RAJA#1293 We're wresting with some CMake export issues between RAJA and camp at the moment in other PRs. Those need to get resolved before we can merge others. |
On the CPU, I would guess that CUDA's LTO does the same, but I haven't tested. The functions in question are essentially one-liners (after conditional compilation), so if their link time optimizer can't inline that, it would have to be practically useless!
I'll reiterate: I am not advocating for abandoning RAJA/Umpire, or writing an axom replacement. I'm confident that many algorithms in axom benefit significantly (in terms of performance and code clarity) by using RAJA for complex tasks, and should absolutely use it in those cases, as those benefits clearly outweigh the costs (compile time). However, I'm saying that if ArrayBase needs a way to fill arrays, it might be worth considering doing that by writing 5 lines of code, as opposed to using a much more sophisticated tool. Here, the benefit is saving a few lines of code, and the associated cost is a 30x slowdown on the By analogy, algorithms like Axom could similarly separate the expensive-to-compile aspects to separate headers to allow users opt-in to which features they want (clearly, this is the goal of breaking up axom into sub-packages in the first place).
Compile-only. Statically linking with axom takes another ~250ms (which is insignificant as far as I'm concerned). The main takeaway is to reflect on one of the core C++ design philosophies: "Don't make users pay for what they don't use" And in this case, it seems axom is making its users pay a significant compile time cost for a lot of features that, in practice, they likely will not use. I understand that it is conceptually convenient to make a single header that just grabs a lot of other stuff so you don't have to think about which features reside in which headers, but that convenience is not free. It comes with a surprisingly significant compile time cost, and I think it would be preferable to better enable users to opt-in to only the features they care about. |
|
I'm working on serac and noticing that a lot of our simple tests take a surprisingly long time to compile.
We noticed a similar problem a while back, that Inlet in particular had some really bizarre combinatorial explosion of template instantiations that took a long time to compile, but this issue relates to
axom/core.hpp
, not Inlet.When I started to write a new small cuda test, I noticed that the trivial executable,
takes 0.5s to compile with
nvcc
on my machine, whilesurprisingly takes 5.5s.
So, just the act of
#include
ing a header from axom added a whopping 5 seconds (?!) to my compile time.I profiled the trivial example above w/
#include <axom/core.hpp>
via clang's-ftime-trace
flag (only works for C++, not CUDA)and the data is here (can be opened in
chrome://tracing
). It revealed a number of things:~25% of that time is spent on the
#include <immintrin.h>
fromBitUtilities.hpp
. The declarations inBitUtilities.hpp
are not function templates, and do not depend on the intrinsics defined inimmintrin.h
, so the implementation could be moved into a separate file that #includesimmintrin.h
and is only compiled once. (Since these particular functions have__host__ __device__
annotations, then this requires separable compilation, but I believe axom is already using this feature).~30% of that time is spent on
Determinants.hpp
andLU.hpp
which seem like an unusual "core" features, most of this time is spent #including umpire stuff for memory allocation. Like above, it seems these allocation/deallocation calls can be abstracted in a way that moves the implementation (and heavy includes) out of the header file. e.g.instead of
do
ArrayBase.hpp
, 97% of which goes towardfor_all.hpp
. I only see two uses offor_all
in that header, and they are for filling an array with a single value. I understand that it's convenient to reusefor_all
here, but a simple kernel definition likeaccomplishes the same outcome, doesn't impact the compilation time at all (still 0.5s after adding this to the trivial example), and is only a few lines of code.
~8% of that time is spent on
Utilities.hpp
, which includes heavy headers likerandom
, but the related functions don't actually need to be in the header (e.g.random_real
).~3% of that time is spent on
Timer.hpp
, and includingchrono
. I don't see any part ofTimer
's interface that needs to know aboutchrono
, so a PImpl version of this class can move the#include <chrono>
and implementation out of the header.The common theme is to avoid putting big #includes and implementations in headers, unless necessary.
The text was updated successfully, but these errors were encountered: