RuntimeError: Job a6df06ad-b422-4aa4-9b19-9408f0a1a40c failed for more than 3 times #544
-
HELP! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
The error log tells that the job |
Beta Was this translation helpful? Give feedback.
-
I am facing the same problem. I checked the And when i manually submit the |
Beta Was this translation helpful? Give feedback.
-
I had the same problem, repeatedly submitting RuntimeError: Job: 9b65aa9c199cd023196be654948f0a7cee94a8ed 76904failed 3times. * * Error in `/home/Haichao/ANACONDA3/envs/deepmd/bin/Python' : Double Free O R corruption (out) : 0x00002b312c11b940 * * , look at the debug file and find RuntimeError: Failed to detect availbe GPUs due to: 2024-10-2906:27:28.129500: I tensorflow/core/platform/CPU. CC: 193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512 FMATO enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024.10-2906:27:28.616975: W TensorFlow/stream/Cuda/Cuda. CC: 374] a non-primary context 0x5ba7720 for device 0exists before initializing the streamexecutor. The primary context is now 0x2b792b8d0752. We Haven' t verified StreamExecutor works with that.2024-10-2906:27:28.617183: F tensorflowcore platform statusor.cc : 33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: Cuda: out of memory; total memory reported: 12627148800 |
Beta Was this translation helpful? Give feedback.
The error log tells that the job
a6df06ad-xxxx
has failed. At this time, you should enter the path where this task executes. It should be in one folder inwork_path
(you specify this inmachine.json
). Enter that folder and you will see a script file, such asxxxx.sub
and some log files. You should see why this jobs fails from error log, and you can also try manually submit the file, and finally fix the problem.