Intermittent `CUDA_ERROR_FILE_NOT_FOUND: file not found` errors #679

matthiasdiener · 2022-05-31T14:53:20Z

Happens in eager and lazy mode.

Setting CUDA_CACHE_DISABLE=1 or CUDA_CACHE_DIR to a node-local FS does not appear to fix this.

Maybe it is missing an fsync() somewhere in the stack?

See e.g. https://github.com/illinois-ceesd/testing/runs/6662629174?check_suite_focus=true#step:3:3438

*** Running parallel example (2 ranks): ../examples/poiseuille-mpi.py
rank 0: sent all mesh partitions
rank 1: received local mesh (size = 142547)
../examples/poiseuille-mpi.py:178: DeprecationWarning: EagerDGDiscretization is deprecated and will go away in 2022. Use the base DiscretizationCollection with grudge.op instead.
  discr = EagerDGDiscretization(
../examples/poiseuille-mpi.py:178: DeprecationWarning: EagerDGDiscretization is deprecated and will go away in 2022. Use the base DiscretizationCollection with grudge.op instead.
  discr = EagerDGDiscretization(
building face restriction: start
building face restriction: start
building face restriction: done
building face restriction: done
bdry comm rank 0 comm begin
bdry comm rank 1 comm begin
build program: kernel 'einsum3to2_kernel' was part of a lengthy source build resulting from a binary cache miss (0.89 s)
build program: kernel 'einsum3to2_kernel' was part of a lengthy source build resulting from a binary cache miss (0.89 s)
CUDA_ERROR_FILE_NOT_FOUND: file not found
bdry comm rank 0 comm end
ERROR:  One or more process (first noticed rank 1) terminated with signal 6
*** Example ../examples/poiseuille-mpi.py failed.

cc @MTCam

The text was updated successfully, but these errors were encountered:

inducer · 2022-09-07T04:25:24Z

Does this happen if the pocl kernel cache is not on a networked file system?
Can we perhaps have pocl check whether the file is there, and if not, have it wait a second or two? (There's a presumption here that this is likely a pocl issue... but I don't know that for sure.)

MTCam · 2022-09-22T21:59:18Z

Does this happen if the pocl kernel cache is not on a networked file system?

Setting POCL_CACHE_DIR to a host-local, and even rank-local file does not seem to have an effect on the frequency with which this error occurs. Have tried both with and without these and other cache directory settings - with no apparent change.

This error has become far more frequent with recent code. Currently we don't seem to be able to run any production-relevant driver beyond 2 ranks due to this (or related) errors. Running 4 ranks for combozzle or even a simple mixture case currently will almost certainly get you one (or multiples) of these messages:

CUDA_ERROR_FILE_NOT_FOUND: file not found
CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid
pocl-cuda: failed to generate PTX

inducer · 2022-09-23T02:17:15Z

The pocl kernel compilation code looks pretty racy:

https://github.com/pocl/pocl/blob/25dd411b0b3f3bd3901d255a7d49bb4f362f6a21/lib/CL/devices/cuda/pocl-cuda.c#L967-L975

(E.g. the existence check races with the write.)

matthiasdiener · 2022-09-23T14:10:53Z

The pocl kernel compilation code looks pretty racy:

https://github.com/pocl/pocl/blob/25dd411b0b3f3bd3901d255a7d49bb4f362f6a21/lib/CL/devices/cuda/pocl-cuda.c#L967-L975

I think we can't really change the filename of the ptx file (e.g., generate a unique random filename) since that would defeat caching, but maybe wrapping the pocl_ptx_gen and cuModuleLoad operations in a flock might help.

However, this race condition seems to not explain that we still see these errors when running with a rank-local pocl cache.

MTCam · 2022-09-28T13:15:22Z

I no longer experience this issue since inducer/arraycontext#198. Close?

matthiasdiener · 2022-09-28T13:41:32Z

I no longer experience this issue since inducer/arraycontext#198. Close?

I think the intermittent errors from the initial message still happen sometimes, right?

MTCam · 2022-09-28T13:47:57Z

I think the intermittent errors from the initial message still happen sometimes, right?

I have not encountered this error at all (on Lassen) since using:

export LOOPY_NO_CACHE=1
export CUDA_CACHE_DISABLE=1

inducer · 2022-09-28T17:48:37Z

Are both of those essential? I.e., specifically, can you do without the LOOPY_NO_CACHE=1?

matthiasdiener · 2023-04-19T21:13:34Z

I think we haven't seen this issue since setting rank-local POCL_CACHE_DIRs. Should we close this?

MTCam · 2023-04-21T16:01:56Z

Are both of those essential? I.e., specifically, can you do without the LOOPY_NO_CACHE=1?

After not using LOOPY_NO_CACHE for a while, this problem has not cropped back up. But I still have CUDA_CACHE_DISABLE=1 all the time. We better hold on to this issue for just a little while longer. This error may come back. I am not sure it is affected at all by our rank-or-node-local caching strategies.

matthiasdiener · 2024-10-08T20:50:56Z

pocl/pocl#1480 should have addressed this, closing.

matthiasdiener added the GPU label May 31, 2022

matthiasdiener mentioned this issue Sep 7, 2022

doc: CUDA_ERROR_FILE_NOT_FOUND #742

Merged

8 tasks

matthiasdiener mentioned this issue Sep 23, 2022

PytatoPyOpenCLArrayContext: don't trust the arg limit reported by the GPU inducer/arraycontext#198

Merged

matthiasdiener mentioned this issue Dec 6, 2022

lp.memoize_on_disk does not use the function name to generate the on-disk location inducer/loopy#709

Closed

matthiasdiener self-assigned this May 23, 2024

matthiasdiener closed this as completed Oct 8, 2024

matthiasdiener added the caching On-disk caching of kernels label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent `CUDA_ERROR_FILE_NOT_FOUND: file not found` errors #679

Intermittent `CUDA_ERROR_FILE_NOT_FOUND: file not found` errors #679

matthiasdiener commented May 31, 2022

inducer commented Sep 7, 2022

MTCam commented Sep 22, 2022

inducer commented Sep 23, 2022

matthiasdiener commented Sep 23, 2022

MTCam commented Sep 28, 2022

matthiasdiener commented Sep 28, 2022

MTCam commented Sep 28, 2022

inducer commented Sep 28, 2022 •

edited

Loading

matthiasdiener commented Apr 19, 2023

MTCam commented Apr 21, 2023

matthiasdiener commented Oct 8, 2024

Intermittent CUDA_ERROR_FILE_NOT_FOUND: file not found errors #679

Intermittent CUDA_ERROR_FILE_NOT_FOUND: file not found errors #679

Comments

matthiasdiener commented May 31, 2022

inducer commented Sep 7, 2022

MTCam commented Sep 22, 2022

inducer commented Sep 23, 2022

matthiasdiener commented Sep 23, 2022

MTCam commented Sep 28, 2022

matthiasdiener commented Sep 28, 2022

MTCam commented Sep 28, 2022

inducer commented Sep 28, 2022 • edited Loading

matthiasdiener commented Apr 19, 2023

MTCam commented Apr 21, 2023

matthiasdiener commented Oct 8, 2024

Intermittent `CUDA_ERROR_FILE_NOT_FOUND: file not found` errors #679

Intermittent `CUDA_ERROR_FILE_NOT_FOUND: file not found` errors #679

inducer commented Sep 28, 2022 •

edited

Loading