Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent CUDA_ERROR_FILE_NOT_FOUND: file not found errors #679

Open
matthiasdiener opened this issue May 31, 2022 · 10 comments
Open

Intermittent CUDA_ERROR_FILE_NOT_FOUND: file not found errors #679

matthiasdiener opened this issue May 31, 2022 · 10 comments
Assignees
Labels

Comments

@matthiasdiener
Copy link
Member

Happens in eager and lazy mode.

Setting CUDA_CACHE_DISABLE=1 or CUDA_CACHE_DIR to a node-local FS does not appear to fix this.

Maybe it is missing an fsync() somewhere in the stack?

See e.g. https://github.com/illinois-ceesd/testing/runs/6662629174?check_suite_focus=true#step:3:3438

*** Running parallel example (2 ranks): ../examples/poiseuille-mpi.py
rank 0: sent all mesh partitions
rank 1: received local mesh (size = 142547)
../examples/poiseuille-mpi.py:178: DeprecationWarning: EagerDGDiscretization is deprecated and will go away in 2022. Use the base DiscretizationCollection with grudge.op instead.
  discr = EagerDGDiscretization(
../examples/poiseuille-mpi.py:178: DeprecationWarning: EagerDGDiscretization is deprecated and will go away in 2022. Use the base DiscretizationCollection with grudge.op instead.
  discr = EagerDGDiscretization(
building face restriction: start
building face restriction: start
building face restriction: done
building face restriction: done
bdry comm rank 0 comm begin
bdry comm rank 1 comm begin
build program: kernel 'einsum3to2_kernel' was part of a lengthy source build resulting from a binary cache miss (0.89 s)
build program: kernel 'einsum3to2_kernel' was part of a lengthy source build resulting from a binary cache miss (0.89 s)
CUDA_ERROR_FILE_NOT_FOUND: file not found
bdry comm rank 0 comm end
ERROR:  One or more process (first noticed rank 1) terminated with signal 6
*** Example ../examples/poiseuille-mpi.py failed.

cc @MTCam

@inducer
Copy link
Contributor

inducer commented Sep 7, 2022

  • Does this happen if the pocl kernel cache is not on a networked file system?
  • Can we perhaps have pocl check whether the file is there, and if not, have it wait a second or two? (There's a presumption here that this is likely a pocl issue... but I don't know that for sure.)

@MTCam
Copy link
Member

MTCam commented Sep 22, 2022

  • Does this happen if the pocl kernel cache is not on a networked file system?

Setting POCL_CACHE_DIR to a host-local, and even rank-local file does not seem to have an effect on the frequency with which this error occurs. Have tried both with and without these and other cache directory settings - with no apparent change.

This error has become far more frequent with recent code. Currently we don't seem to be able to run any production-relevant driver beyond 2 ranks due to this (or related) errors. Running 4 ranks for combozzle or even a simple mixture case currently will almost certainly get you one (or multiples) of these messages:

CUDA_ERROR_FILE_NOT_FOUND: file not found
CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid
pocl-cuda: failed to generate PTX

@inducer
Copy link
Contributor

inducer commented Sep 23, 2022

The pocl kernel compilation code looks pretty racy:

https://github.com/pocl/pocl/blob/25dd411b0b3f3bd3901d255a7d49bb4f362f6a21/lib/CL/devices/cuda/pocl-cuda.c#L967-L975

(E.g. the existence check races with the write.)

@matthiasdiener
Copy link
Member Author

The pocl kernel compilation code looks pretty racy:

https://github.com/pocl/pocl/blob/25dd411b0b3f3bd3901d255a7d49bb4f362f6a21/lib/CL/devices/cuda/pocl-cuda.c#L967-L975

I think we can't really change the filename of the ptx file (e.g., generate a unique random filename) since that would defeat caching, but maybe wrapping the pocl_ptx_gen and cuModuleLoad operations in a flock might help.

However, this race condition seems to not explain that we still see these errors when running with a rank-local pocl cache.

@MTCam
Copy link
Member

MTCam commented Sep 28, 2022

I no longer experience this issue since inducer/arraycontext#198. Close?

@matthiasdiener
Copy link
Member Author

I no longer experience this issue since inducer/arraycontext#198. Close?

I think the intermittent errors from the initial message still happen sometimes, right?

@MTCam
Copy link
Member

MTCam commented Sep 28, 2022

I think the intermittent errors from the initial message still happen sometimes, right?

I have not encountered this error at all (on Lassen) since using:

export LOOPY_NO_CACHE=1
export CUDA_CACHE_DISABLE=1

@inducer
Copy link
Contributor

inducer commented Sep 28, 2022

Are both of those essential? I.e., specifically, can you do without the LOOPY_NO_CACHE=1?

@matthiasdiener
Copy link
Member Author

I think we haven't seen this issue since setting rank-local POCL_CACHE_DIRs. Should we close this?

@MTCam
Copy link
Member

MTCam commented Apr 21, 2023

Are both of those essential? I.e., specifically, can you do without the LOOPY_NO_CACHE=1?

After not using LOOPY_NO_CACHE for a while, this problem has not cropped back up. But I still have CUDA_CACHE_DISABLE=1 all the time. We better hold on to this issue for just a little while longer. This error may come back. I am not sure it is affected at all by our rank-or-node-local caching strategies.

@matthiasdiener matthiasdiener self-assigned this May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants