Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building with high parallelism and CUDA support results in sporadic build failures #1313

Open
amarshall opened this issue Jul 24, 2023 · 5 comments

Comments

@amarshall
Copy link

Building with 48 threads, of 50 sequential builds, 19 failed (38% failure rate). Am building via nixpkgs drv, but I don’t see any reason why it’s specific to that build environment. Building without CUDA saw no failures in 50 runs.

My guess is there’s an implicit dependency somewhere, I spent a brief bit trying to find it but did not (I’m not very proficient with CMake).

I have seen at least two different failures:

CMake Error at /nix/store/0dv0ylafnx7cdajyv9ahbpqrniblixq1-cmake-3.26.4/share/cmake-3.26/Modules/FindCUDA/make2cmake.cmake:48 (file):
  file failed to open for reading (No such file or directory):

    /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.NVCC-depend


CMake Error at osd_static_gpu_generated_cudaKernel.cu.o.Release.cmake:236 (message):
  Error generating
  /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/./osd_static_gpu_generated_cudaKernel.cu.o


make[2]: *** [opensubdiv/CMakeFiles/osd_dynamic_gpu.dir/build.make:77: opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o] Error 1

and

Error copying file (if different) from "/build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.depend.tmp" to "/build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o.depend".
CMake Error at osd_static_gpu_generated_cudaKernel.cu.o.Release.cmake:246 (message):
  Error generating
  /build/source/build/opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/./osd_static_gpu_generated_cudaKernel.cu.o


make[2]: *** [opensubdiv/CMakeFiles/osd_dynamic_gpu.dir/build.make:77: opensubdiv/CMakeFiles/osd_static_gpu.dir/osd/osd_static_gpu_generated_cudaKernel.cu.o] Error 1
@davidgyu
Copy link
Member

Filed as internal issue #OSD-426

@davidgyu
Copy link
Member

Interesting. We haven't seen that before.
Can you tell us more about your system configuration: OS, Compiler, GPU, Driver version, CUDA version?

@amarshall
Copy link
Author

amarshall commented Jul 25, 2023

Hi! Thanks for the reply.

  • OS is NixOS @ NixOS/nixpkgs@9ca7856 (Linux Kernel 6.1)
  • Opensubdiv src @ v3.5.0
  • GCC 12.3.0 (note that -DCUDA_HOST_COMPILER is different), CMake 3.26.4
  • CUDA toolkit 11.8.0
  • CPU is AMD 3960X (24-core, 48-threads), 192 GB RAM
  • GPU is 3080 Ti with driver 535.86.05 (however I think this should not matter, as I don’t believe the GPU is used during build)
Log output of configure stage + build flags

Note that I have manually wrapped the cmake flags to make them easier to read.

@nix { "action": "setPhase", "phase": "configurePhase" }
configuring
fixing cmake files...
cmake flags:
  -DCMAKE_FIND_USE_SYSTEM_PACKAGE_REGISTRY=OFF
  -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF
  -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON
  -DCMAKE_BUILD_TYPE=Release
  -DBUILD_TESTING=OFF
  -DCMAKE_INSTALL_LOCALEDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/locale
  -DCMAKE_INSTALL_LIBEXECDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/libexec
  -DCMAKE_INSTALL_LIBDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/lib
  -DCMAKE_INSTALL_DOCDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/doc/OpenSubdiv
  -DCMAKE_INSTALL_INFODIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/info
  -DCMAKE_INSTALL_MANDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/share/man
  -DCMAKE_INSTALL_OLDINCLUDEDIR=/nix/store/1np3p9y42nv1m06ywspgqj20r5p41xla-opensubdiv-3.5.0-dev/include
  -DCMAKE_INSTALL_INCLUDEDIR=/nix/store/1np3p9y42nv1m06ywspgqj20r5p41xla-opensubdiv-3.5.0-dev/include
  -DCMAKE_INSTALL_SBINDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/sbin
  -DCMAKE_INSTALL_BINDIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/bin
  -DCMAKE_INSTALL_NAME_DIR=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0/lib
  -DCMAKE_POLICY_DEFAULT_CMP0025=NEW
  -DCMAKE_OSX_SYSROOT=
  -DCMAKE_FIND_FRAMEWORK=LAST
  -DCMAKE_STRIP=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/strip
  -DCMAKE_RANLIB=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/ranlib
  -DCMAKE_AR=/nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/ar
  -DCMAKE_C_COMPILER=gcc
  -DCMAKE_CXX_COMPILER=g++
  -DCMAKE_INSTALL_PREFIX=/nix/store/aw2139d316dfcan625spblpib2449b33-opensubdiv-3.5.0
  -DNO_TUTORIALS=1
  -DNO_REGRESSION=1
  -DNO_EXAMPLES=1
  -DNO_METAL=1
  -DGLEW_INCLUDE_DIR=/nix/store/55n26bd7l2jdxj8fkh688nrv290d3hp8-glew-2.2.0-dev/include
  -DGLEW_LIBRARY=/nix/store/55n26bd7l2jdxj8fkh688nrv290d3hp8-glew-2.2.0-dev/lib
  -DOSD_CUDA_NVCC_FLAGS=--gpu-architecture=compute_37
  -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin/cc
  -DNO_OPENCL=1
  -DCUDA_TOOLKIT_ROOT_DIR=/nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0
  -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin
  -DCMAKE_CUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin
/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin/cc -DNO_OPENCL=1 -DCUDA_TOOLKIT_ROOT_DIR=/nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0 -DCUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin -DCMAKE_CUDA_HOST_COMPILER=/nix/store/m3lj9k2f39yplgr81pv9j1p13p3mq0pz-gcc-wrapper-11.4.0/bin 
-- The C compiler identification is GNU 12.3.0
-- The CXX compiler identification is GNU 12.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /nix/store/x7n44lfys59k5ajj9w1fkxw5391cnn5v-gcc-wrapper-12.3.0/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Compiling OpenSubdiv version v3_5_0
-- Using cmake version 3.26.4
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Could NOT find TBB (missing: TBB_INCLUDE_DIR TBB_LIBRARIES) (Required is at least version "4.0")
-- Found OpenGL: /nix/store/xibw0p5bj2z3a566mannk3vflb9f5fph-libGL-1.6.0/lib/libOpenGL.so   
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found CUDA: /nix/store/vxw61j9ff7d5jdq2cwy1bh4q5j82jvy5-cudatoolkit-11.8.0 (found suitable version "11.8", minimum required is "4.0") 
-- Found X11: /nix/store/gz38plw089ri9k2lh7gzhh58ydhb3rv1-xorgproto-2023.2/include   
-- Looking for XOpenDisplay in /nix/store/igp21718s3sa932z7baqnhlc72v0zl0z-libX11-1.8.6/lib/libX11.so;/nix/store/4s3wrg560496dx3qx8gnvvjqz4hc9222-libXext-1.3.5/lib/libXext.so
-- Looking for XOpenDisplay in /nix/store/igp21718s3sa932z7baqnhlc72v0zl0z-libX11-1.8.6/lib/libX11.so;/nix/store/4s3wrg560496dx3qx8gnvvjqz4hc9222-libXext-1.3.5/lib/libXext.so - found
-- Looking for gethostbyname
-- Looking for gethostbyname - found
-- Looking for connect
-- Looking for connect - found
-- Looking for remove
-- Looking for remove - found
-- Looking for shmat
-- Looking for shmat - found
-- Could NOT find GLFW (missing: GLFW_INCLUDE_DIR GLFW_LIBRARIES) (Required is at least version "3.0.0")
-- Could NOT find PTex (missing: PTEX_INCLUDE_DIR PTEX_LIBRARY) (Required is at least version "2.0")
-- Could NOT find ZLIB (missing: ZLIB_LIBRARY ZLIB_INCLUDE_DIR) (Required is at least version "1.2")
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE) (Required is at least version "1.8.4")
-- Could NOT find Docutils (missing: RST2HTML_EXECUTABLE DOCUTILS_VERSION) (Required is at least version "0.9")
-- Found Python: /nix/store/9c03r86hcdn43dm3hsgjirifvyzfkhwh-python3-3.10.12/bin/python3.10 (found version "3.10.12") found components: Interpreter 
CMake Warning at CMakeLists.txt:430 (message):
  TBB was not found : support for TBB parallel compute kernels will be
  disabled in Osd.  If your compiler supports TBB directives, please refer to
  the FindTBB.cmake shared module in your cmake installation.


CMake Warning at CMakeLists.txt:619 (message):
  Ptex was not found : the OpenSubdiv Ptex example will not be available.  If
  you do have Ptex installed and see this message, please add your Ptex path
  to FindPTex.cmake in /build/source/cmake or set it through the
  PTEX_LOCATION cmake command line argument or environment variable.


CMake Warning at documentation/CMakeLists.txt:52 (message):
  Doxyen was not found : support for Doxygen automated API documentation is
  disabled.


-- Configuring done (3.6s)
-- Generating done (0.0s)
CMake Warning:
  Manually-specified variables were not used by the project:

    BUILD_TESTING
    CMAKE_EXPORT_NO_PACKAGE_REGISTRY
    CMAKE_POLICY_DEFAULT_CMP0025
    GLEW_LIBRARY


-- Build files have been written to: /build/source/build
cmake: enabled parallel building
cmake: enabled parallel installing
@nix { "action": "setPhase", "phase": "buildPhase" }
building
build flags: -j48 SHELL=/nix/store/a7f7xfp9wyghf44yv6l6fv9dfw492hd3-bash-5.2-p15/bin/bash

(Remainder of logs omitted)

@davidgyu
Copy link
Member

Thanks for the additional information!

@bonsairobo
Copy link

I just hit this failure when building nixpkgs. The build succeeded on retry. Just making it known that the workaround is not a silver bullet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants