Not running with GPU #25

YunjieChang · 2022-08-31T05:26:28Z

Hi Tim,

I just installed cryoCARE on our HPC following the installation procedure "For CUDA 10" and did not meet any errors during the installation.

However, I got the following message when I tried to run the training process (cryoCARE_train.py --conf train_config.json):

================================
2022-08-31 11:33:43.111390: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
0 1
1 72
2 72
3 72
4 1
2022-08-31 11:33:43.730687: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-08-31 11:33:43.731272: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2200000000 Hz
=================================

This information says that cryoCARE is not using GPU to do the training, instead it is using CPU, therefore, it is quite slow.
My tomogram size is 672672200.

Any idea about this issue?
Thanks!
Yunjie

The text was updated successfully, but these errors were encountered:

tibuch · 2022-09-12T08:20:40Z

Hi Yunjie,

Does TensorFlow see the GPU on your cluster node where you are running the training? I would recommend to start an interactive cluster session and then check if the GPU is available with nvidia-smi. Then you can check if the installed CUDA is compatible with your TensorFlow installation and finally I would run this TensorFlow installation verification code from their install instructions (https://www.tensorflow.org/install/pip):

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Cheers!

tailinhua16 · 2023-08-22T16:31:06Z

Hi Tim,
I've encountered a similar issue where cryocare doesn't use GPU, I'm using a workstation instead of a cluser, when I use the verification code you mentioned, the output was:

2023-08-23 00:26:26.044988: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-08-23 00:26:27.342214: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-08-23 00:26:27.343332: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-08-23 00:26:27.374010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:1a:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.374707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:1b:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.375389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:3d:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.376012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:3e:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.376638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 4 with properties:
pciBusID: 0000:88:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.377264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 5 with properties:
pciBusID: 0000:89:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.377868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 6 with properties:
pciBusID: 0000:b1:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.378519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 7 with properties:
pciBusID: 0000:b2:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.378561: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-08-23 00:26:27.382425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/linhua/Programs/anaconda3/envs/cryocare_11/bin/../lib/libcublas.so.11: symbol free_gemm_select, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-9.1/lib64:/usr/local/cuda-8.0/lib64:/usr/local/cuda/lib64:/usr/local/cuda-11.8/lib64:/opt/OpenMPI/lib:/opt/OpenMPI/lib::/usr/local/cuda-10.0/lib64
2023-08-23 00:26:27.384977: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-08-23 00:26:27.386268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-08-23 00:26:27.386513: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-08-23 00:26:27.389644: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-08-23 00:26:27.390326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-08-23 00:26:27.390447: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-08-23 00:26:27.390471: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

I'm using cryocare_11, any idea how to solve this problem?
Thank you very much in advance!
Yours,
Linhua Tai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not running with GPU #25

Not running with GPU #25

YunjieChang commented Aug 31, 2022 •

edited by tibuch

Loading

tibuch commented Sep 12, 2022

tailinhua16 commented Aug 22, 2023

Not running with GPU #25

Not running with GPU #25

Comments

YunjieChang commented Aug 31, 2022 • edited by tibuch Loading

tibuch commented Sep 12, 2022

tailinhua16 commented Aug 22, 2023

Hi Tim, I've encountered a similar issue where cryocare doesn't use GPU, I'm using a workstation instead of a cluser, when I use the verification code you mentioned, the output was:

YunjieChang commented Aug 31, 2022 •

edited by tibuch

Loading

Hi Tim,
I've encountered a similar issue where cryocare doesn't use GPU, I'm using a workstation instead of a cluser, when I use the verification code you mentioned, the output was: