Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

long #include times for axom headers #872

Open
samuelpmishLLNL opened this issue Jul 19, 2022 · 4 comments
Open

long #include times for axom headers #872

samuelpmishLLNL opened this issue Jul 19, 2022 · 4 comments
Labels

Comments

@samuelpmishLLNL
Copy link
Contributor

I'm working on serac and noticing that a lot of our simple tests take a surprisingly long time to compile.

We noticed a similar problem a while back, that Inlet in particular had some really bizarre combinatorial explosion of template instantiations that took a long time to compile, but this issue relates to axom/core.hpp, not Inlet.

When I started to write a new small cuda test, I noticed that the trivial executable,

int main() { return 0; }

takes 0.5s to compile with nvcc on my machine, while

#include <axom/core.hpp>
int main() { return 0; }

surprisingly takes 5.5s.

So, just the act of #includeing a header from axom added a whopping 5 seconds (?!) to my compile time.


I profiled the trivial example above w/ #include <axom/core.hpp> via clang's -ftime-trace flag (only works for C++, not CUDA)
and the data is here (can be opened in chrome://tracing). It revealed a number of things:

  • ~25% of that time is spent on the #include <immintrin.h> from BitUtilities.hpp. The declarations in BitUtilities.hpp are not function templates, and do not depend on the intrinsics defined in immintrin.h, so the implementation could be moved into a separate file that #includes immintrin.h and is only compiled once. (Since these particular functions have __host__ __device__ annotations, then this requires separable compilation, but I believe axom is already using this feature).

  • ~30% of that time is spent on Determinants.hpp and LU.hpp which seem like an unusual "core" features, most of this time is spent #including umpire stuff for memory allocation. Like above, it seems these allocation/deallocation calls can be abstracted in a way that moves the implementation (and heavy includes) out of the header file. e.g.

instead of

//--------------------------------
// src/axom/core/memory_management.hpp
template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
  const std::size_t numbytes = n * sizeof(T);
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance(); // now I have to #include all the umpire stuff in this header
  umpire::Allocator allocator = rm.getAllocator(allocID);
  return static_cast<T*>(allocator.allocate(numbytes));
#else
  AXOM_UNUSED_VAR(allocID);
  return static_cast<T*>(std::malloc(numbytes));
#endif
}
//--------------------------------

do

//--------------------------------
// src/axom/core/memory_management.hpp 

void * allocate(std::size_t num_bytes, int allocID);

template <typename T>
inline T* allocate(std::size_t n, int allocID) noexcept
{
  return static_cast<T *>(allocate(n * sizeof(T), allocID));
}

//--------------------------------
// src/axom/core/memory_management.cpp

#include <big/umpire/headers.hpp> // only included in the .cpp file, compiled once
void * allocate(std::size_t num_bytes, int allocID) {
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  return rm.getAllocator(allocID).allocate(numbytes));
#else
  return std::malloc(numbytes);
#endif
}
//--------------------------------
  • 15% of that time is spent on ArrayBase.hpp, 97% of which goes toward for_all.hpp. I only see two uses of for_all in that header, and they are for filling an array with a single value. I understand that it's convenient to reuse for_all here, but a simple kernel definition like
template < typename T >                                                             
__global__ void fill(T * ptr, T value, size_t n) {                                  
  int tid = threadIdx.x + blockIdx.x * blockDim.x;                                  
  if (tid < n) { ptr[tid] = value; }                                                
}

accomplishes the same outcome, doesn't impact the compilation time at all (still 0.5s after adding this to the trivial example), and is only a few lines of code.

  • ~8% of that time is spent on Utilities.hpp, which includes heavy headers like random, but the related functions don't actually need to be in the header (e.g. random_real).

  • ~3% of that time is spent on Timer.hpp, and including chrono. I don't see any part of Timer's interface that needs to know about chrono, so a PImpl version of this class can move the #include <chrono> and implementation out of the header.


The common theme is to avoid putting big #includes and implementations in headers, unless necessary.

@publixsubfan
Copy link
Contributor

Thanks for writing this up @samuelpmishLLNL! Some comments:

  • immintrin.h: This does seem like a large independent contributor to compile times that would benefit from a compilation firewall. However, I am worried about the impact to inlineability -- we expect these to reduce down to their corresponding CUDA intrinsics within a kernel. Does LTO reliably take care of this for us?

  • memory_management.hpp/Umpire: Full agree on this one, inlining shouldn't be a concern with an operation as expensive as memory allocation. It seems that a considerable portion of Umpire's compile-time cost (~40%) is from including standard headers like <memory>, <unordered_map>, and <algorithm>.

  • for_all.hpp: It seems like the entirety of this header's compile-time cost is actually from RAJA.hpp. I believe Axom currently requires RAJA and Umpire both be enabled for CUDA support, and I don't know if providing our own GPU kernel abstraction independent of RAJA is a road we want to go down. In any case, I think for_all.hpp is independently included in the axom/core.hpp unified header.

  • random/chrono: These look like simple enough changes with minimal downside.


As an aside, are the nvcc timings compile-only, or compile+link? It might be worthwhile to drill down to see where we're spending all this time.

  • On the clang host-only build, what's the timing for the overall compile+link? Can you switch out GNU ld for lld? (-fuse-ld=lld on the link line IIRC)
  • I'm unsure of where Axom is wrt clang-cuda support, but if ftime-trace works with that, we could see where we're spending compile time on the device code side.

@rhornung67
Copy link
Member

@kanye-quest FWIW, we've worked through the remaining issues to get RAJA to build and run properly with clang-cuda support: LLNL/RAJA#1293 We're wresting with some CMake export issues between RAJA and camp at the moment in other PRs. Those need to get resolved before we can merge others.

@samuelpmishLLNL
Copy link
Contributor Author

immintrin.h: This does seem like a large independent contributor to compile times that would benefit from a compilation firewall. However, I am worried about the impact to inlineability -- we expect these to reduce down to their corresponding CUDA intrinsics within a kernel. Does LTO reliably take care of this for us?

On the CPU, -flto does give you the inlining you want: https://godbolt.org/z/9x6jrorjs

I would guess that CUDA's LTO does the same, but I haven't tested. The functions in question are essentially one-liners (after conditional compilation), so if their link time optimizer can't inline that, it would have to be practically useless!

for_all.hpp: It seems like the entirety of this header's compile-time cost is actually from RAJA.hpp. I believe Axom currently requires RAJA and Umpire both be enabled for CUDA support, and I don't know if providing our own GPU kernel abstraction independent of RAJA is a road we want to go down. In any case, I think for_all.hpp is independently included in the axom/core.hpp unified header.

I'll reiterate: I am not advocating for abandoning RAJA/Umpire, or writing an axom replacement. I'm confident that many algorithms in axom benefit significantly (in terms of performance and code clarity) by using RAJA for complex tasks, and should absolutely use it in those cases, as those benefits clearly outweigh the costs (compile time).

However, I'm saying that if ArrayBase needs a way to fill arrays, it might be worth considering doing that by writing 5 lines of code, as opposed to using a much more sophisticated tool. Here, the benefit is saving a few lines of code, and the associated cost is a 30x slowdown on the #include time. To me, this is an unattractive tradeoff.

By analogy, algorithms like std::sort, std::transform, etc are incredibly useful pieces of the standard library, but you have to explicitly #include <algorithm> to use them, because <algorithm> is notoriously expensive to include. The standard library could have made its container headers like <vector>, <array>, etc all automatically include <algorithm>, but they don't because it would incur a significant cost that the user would have no way to avoid.

Axom could similarly separate the expensive-to-compile aspects to separate headers to allow users opt-in to which features they want (clearly, this is the goal of breaking up axom into sub-packages in the first place).

As an aside, are the nvcc timings compile-only, or compile+link? It might be worthwhile to drill down to see where we're spending all this time.

Compile-only. Statically linking with axom takes another ~250ms (which is insignificant as far as I'm concerned).


The main takeaway is to reflect on one of the core C++ design philosophies:

"Don't make users pay for what they don't use"

And in this case, it seems axom is making its users pay a significant compile time cost for a lot of features that, in practice, they likely will not use.

I understand that it is conceptually convenient to make a single header that just grabs a lot of other stuff so you don't have to think about which features reside in which headers, but that convenience is not free. It comes with a surprisingly significant compile time cost, and I think it would be preferable to better enable users to opt-in to only the features they care about.

@rhornung67
Copy link
Member

  • Reorg memory stuff as described
  • Break apart core numerics stuff?
  • Set up a meeting ("bug fix" day?) @rhornung67

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants