Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not running with GPU #25

Open
YunjieChang opened this issue Aug 31, 2022 · 2 comments
Open

Not running with GPU #25

YunjieChang opened this issue Aug 31, 2022 · 2 comments

Comments

@YunjieChang
Copy link

YunjieChang commented Aug 31, 2022

Hi Tim,

I just installed cryoCARE on our HPC following the installation procedure "For CUDA 10" and did not meet any errors during the installation.

However, I got the following message when I tried to run the training process (cryoCARE_train.py --conf train_config.json):

================================
2022-08-31 11:33:43.111390: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
0 1
1 72
2 72
3 72
4 1
2022-08-31 11:33:43.730687: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-08-31 11:33:43.731272: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2200000000 Hz
=================================

This information says that cryoCARE is not using GPU to do the training, instead it is using CPU, therefore, it is quite slow.
My tomogram size is 672672200.

Any idea about this issue?
Thanks!
Yunjie

@tibuch
Copy link
Collaborator

tibuch commented Sep 12, 2022

Hi Yunjie,

Does TensorFlow see the GPU on your cluster node where you are running the training? I would recommend to start an interactive cluster session and then check if the GPU is available with nvidia-smi. Then you can check if the installed CUDA is compatible with your TensorFlow installation and finally I would run this TensorFlow installation verification code from their install instructions (https://www.tensorflow.org/install/pip):

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Cheers!

@tailinhua16
Copy link

Hi Tim,
I've encountered a similar issue where cryocare doesn't use GPU, I'm using a workstation instead of a cluser, when I use the verification code you mentioned, the output was:

2023-08-23 00:26:26.044988: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-08-23 00:26:27.342214: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-08-23 00:26:27.343332: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-08-23 00:26:27.374010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:1a:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.374707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:1b:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.375389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:3d:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.376012: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:3e:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.376638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 4 with properties:
pciBusID: 0000:88:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.377264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 5 with properties:
pciBusID: 0000:89:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.377868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 6 with properties:
pciBusID: 0000:b1:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.378519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 7 with properties:
pciBusID: 0000:b2:00.0 name: Quadro RTX 5000 computeCapability: 7.5
coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 15.74GiB deviceMemoryBandwidth: 417.29GiB/s
2023-08-23 00:26:27.378561: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-08-23 00:26:27.382425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/linhua/Programs/anaconda3/envs/cryocare_11/bin/../lib/libcublas.so.11: symbol free_gemm_select, version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-9.1/lib64:/usr/local/cuda-8.0/lib64:/usr/local/cuda/lib64:/usr/local/cuda-11.8/lib64:/opt/OpenMPI/lib:/opt/OpenMPI/lib::/usr/local/cuda-10.0/lib64
2023-08-23 00:26:27.384977: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-08-23 00:26:27.386268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-08-23 00:26:27.386513: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-08-23 00:26:27.389644: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2023-08-23 00:26:27.390326: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-08-23 00:26:27.390447: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-08-23 00:26:27.390471: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]

I'm using cryocare_11, any idea how to solve this problem?
Thank you very much in advance!
Yours,
Linhua Tai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants