-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent CUDA_ERROR_FILE_NOT_FOUND: file not found
errors
#679
Comments
|
Setting POCL_CACHE_DIR to a host-local, and even rank-local file does not seem to have an effect on the frequency with which this error occurs. Have tried both with and without these and other cache directory settings - with no apparent change. This error has become far more frequent with recent code. Currently we don't seem to be able to run any production-relevant driver beyond 2 ranks due to this (or related) errors. Running 4 ranks for
|
The pocl kernel compilation code looks pretty racy: (E.g. the existence check races with the write.) |
I think we can't really change the filename of the ptx file (e.g., generate a unique random filename) since that would defeat caching, but maybe wrapping the However, this race condition seems to not explain that we still see these errors when running with a rank-local pocl cache. |
I no longer experience this issue since inducer/arraycontext#198. Close? |
I think the intermittent errors from the initial message still happen sometimes, right? |
I have not encountered this error at all (on Lassen) since using:
|
Are both of those essential? I.e., specifically, can you do without the |
I think we haven't seen this issue since setting rank-local POCL_CACHE_DIRs. Should we close this? |
After not using LOOPY_NO_CACHE for a while, this problem has not cropped back up. But I still have CUDA_CACHE_DISABLE=1 all the time. We better hold on to this issue for just a little while longer. This error may come back. I am not sure it is affected at all by our rank-or-node-local caching strategies. |
pocl/pocl#1480 should have addressed this, closing. |
Happens in eager and lazy mode.
Setting
CUDA_CACHE_DISABLE=1
orCUDA_CACHE_DIR
to a node-local FS does not appear to fix this.Maybe it is missing an
fsync()
somewhere in the stack?See e.g. https://github.com/illinois-ceesd/testing/runs/6662629174?check_suite_focus=true#step:3:3438
cc @MTCam
The text was updated successfully, but these errors were encountered: