Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker image for icefall #1189

Merged
merged 32 commits into from
Jul 28, 2023
Merged

Add docker image for icefall #1189

merged 32 commits into from
Jul 28, 2023

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Jul 27, 2023

Usage

docker run --gpus all --rm -it  k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash

Would be great if someone can test it.

@teowenshen
Copy link
Contributor

Somehow the CUDA version inside the container shows error for me.

I am not sure about the difference between --runtime=nvidia and --gpus=all, but I tried all 3 combinations.

teo@s64:~$ nvidia-smi
Thu Jul 27 20:43:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   49C    P8    30W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
teo@s64:~$ docker run --rm -it  k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash
root@ae8acaf91ea0:/workspace/icefall# nvidia-smi
bash: nvidia-smi: command not found
root@ae8acaf91ea0:/workspace/icefall# exit
teo@s64:~$ docker run --rm -it --gpus=all k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash
root@c5f8c3bc2883:/workspace/icefall# nvidia-smi
Thu Jul 27 11:43:39 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: ERR!     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   50C    P8    42W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G                                       4MiB |
+-----------------------------------------------------------------------------+
root@c5f8c3bc2883:/workspace/icefall# exit
teo@s64:~$ docker run --rm -it --runtime=nvidia k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash
root@1cb84e2f0eaa:/workspace/icefall# nvidia-smi
Thu Jul 27 11:43:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: ERR!     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   49C    P8    30W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G                                       4MiB |
+-----------------------------------------------------------------------------+
root@1cb84e2f0eaa:/workspace/icefall# exit
teo@s64:~$ docker run --rm -it --runtime=nvidia --gpus=all k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash
root@bcc29879157e:/workspace/icefall# nvidia-smi
Thu Jul 27 11:44:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: ERR!     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   49C    P8    32W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G                                       4MiB |
+-----------------------------------------------------------------------------+
root@bcc29879157e:/workspace/icefall# exit

With this error, I tried running a decode.py. This warning came up and GPU was not utilized.

/opt/conda/lib/python3.9/site-packages/torch/cuda/init.py:88: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /opt/conda/conda-bld/pytorch_1666643016022/work/c10/cuda/CUDAFunctions.cpp:109.)

@danpovey
Copy link
Collaborator

maybe need to run with nvidia-docker instead of docker if planning to use the GPU?

@teowenshen
Copy link
Contributor

I tried with nvidia-docker, but it's the same.
In the docker image ls below, icefall is the image that I compiled locally using the existing docker build file while k2fsa/icefall is the new downloaded image.
My local image could detect the GPU, while the downloaded image could not.

There is a size difference between my image and the new image. It is partly because I installed sherpa into my image, but I am not sure if the entire 4.2GB came from sherpa.

teo@s64:~$ nvidia-docker run --rm -it  k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash
root@84b25aedbe2d:/workspace/icefall# nvidia-smi
Thu Jul 27 13:17:12 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: ERR!     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   51C    P8    31W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G                                       4MiB |
+-----------------------------------------------------------------------------+
root@84b25aedbe2d:/workspace/icefall# exit
teo@s64:~$ docker image ls
REPOSITORY                    TAG                               IMAGE ID       CREATED         SIZE
k2fsa/icefall                 torch1.13.0-cuda11.6              3b0e967ec78a   2 hours ago     12.1GB
icefall                       latest                            3be62493eab5   5 days ago      16.3GB
...
[omitted]
...
teo@s64:~$ nvidia-docker run -it --rm icefall bash
root@2e812284fec7:/workspace/icefall# nvidia-smi
Thu Jul 27 13:17:37 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   50C    P8    30W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G                                       4MiB |
+-----------------------------------------------------------------------------+

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Jul 27, 2023

@teowenshen

Thanks for testing it.

The icefall docker image is based on

FROM pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime

Could you test pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime and see if it works with nvidia-smi?

If not, could you try

pytorch/pytorch:1.13.0-cuda11.6-cudnn8-devel 

Also, are there warnings from

python3 -m torch.utils.collect_env

@teowenshen
Copy link
Contributor

@csukuangfj

Yeah, in my environment nvidia-smi runs normally in the base images.

teo@s64:~$ nvidia-docker run -it --rm pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime
Unable to find image 'pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime' locally
1.13.0-cuda11.6-cudnn8-runtime: Pulling from pytorch/pytorch
a404e5416296: Already exists 
d70bbcbd9fa5: Already exists 
2f8d87f6e9b5: Already exists 
f0869fc58250: Already exists 
Digest: sha256:8711d55e2b5c42f3c070e1f2bacc2d1988c9b3b5b99694abc6691a852536efbe
Status: Downloaded newer image for pytorch/pytorch:1.13.0-cuda11.6-cudnn8-runtime
root@9f5ea5b4a8de:/workspace# nvidia-smi
Thu Jul 27 15:40:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   50C    P8    30W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G                                       4MiB |
+-----------------------------------------------------------------------------+
root@9f5ea5b4a8de:/workspace# exit
teo@s64:~$ nvidia-docker run -it --rm pytorch/pytorch:1.13.0-cuda11.6-cudnn8-devel
Unable to find image 'pytorch/pytorch:1.13.0-cuda11.6-cudnn8-devel' locally
1.13.0-cuda11.6-cudnn8-devel: Pulling from pytorch/pytorch
a404e5416296: Already exists 
c58c079e9b17: Pull complete 
e5b80b8bbe91: Pull complete 
888240790290: Pull complete 
515fe5e34eb4: Pull complete 
4e4521f12f5a: Pull complete 
f6e1a56cb32d: Pull complete 
c29b96e36bd0: Pull complete 
304d3c6c28d0: Pull complete 
20f82224b265: Pull complete 
031e73b7201f: Pull complete 
80568f2c07b0: Pull complete 
2ae0d162c09b: Pull complete 
Digest: sha256:d98a1b1f61166875882e5a3ffa63bdef89c3349ceca1954dda415c5cd67e06a0
Status: Downloaded newer image for pytorch/pytorch:1.13.0-cuda11.6-cudnn8-devel
root@9a7a885b4cc9:/workspace# nvidia-smi
Thu Jul 27 15:45:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:18:00.0 Off |                  N/A |
|  0%   51C    P8    42W / 350W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1758      G                                       4MiB |
+-----------------------------------------------------------------------------+
root@9a7a885b4cc9:/workspace# exit

This was returned for python3 -m torch.utils.collect_env in the new downloaded Icefall container. The error is similar to the warning in decode.py.

teo@s64:~$ nvidia-docker run --rm -it  k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash
root@18070e2baeb4:/workspace/icefall# python -m torch.utils.collect_env
Collecting environment information...
/opt/conda/lib/python3.9/site-packages/torch/cuda/__init__.py:88: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /opt/conda/conda-bld/pytorch_1666643016022/work/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.13.0
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.27

Python version: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
Is CUDA available: False
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 510.47.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] k2==1.24.3.dev20230725+cuda11.6.torch1.13.0
[pip3] kaldifeat==1.25.0.dev20230726+cuda11.6.torch1.13.0
[pip3] numpy==1.22.3
[pip3] torch==1.13.0
[pip3] torchaudio==0.13.0+cu116
[pip3] torchtext==0.14.0
[pip3] torchvision==0.14.0
[conda] blas                      1.0                         mkl  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] k2                        1.24.3.dev20230725+cuda11.6.torch1.13.0          pypi_0    pypi
[conda] kaldifeat                 1.25.0.dev20230726+cuda11.6.torch1.13.0          pypi_0    pypi
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0            py39h7f8727e_0  
[conda] mkl_fft                   1.3.1            py39hd3c417c_0  
[conda] mkl_random                1.2.2            py39h51133e4_0  
[conda] numpy                     1.22.3           py39he7a7128_0  
[conda] numpy-base                1.22.3           py39hf524024_0  
[conda] pytorch                   1.13.0          py3.9_cuda11.6_cudnn8.3.2_0    pytorch
[conda] pytorch-cuda              11.6                 h867d48c_0    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.13.0+cu116             pypi_0    pypi
[conda] torchtext                 0.14.0                     py39    pytorch
[conda] torchvision               0.14.0               py39_cu116    pytorch

@csukuangfj
Copy link
Collaborator Author

Could you remove the following line from the icefall docker image
https://github.com/csukuangfj/icefall/blob/130ac92d05e11a502dd1d7dca9ea7c76e5a855d8/docker/torch1.13.0-cuda11.6-cudnn8.dockerfile#L47
and retry?

RUN cd /opt/conda/lib/stubs && ln -s libcuda.so libcuda.so.1

That line is added to fix a GitHub action test error, saying that libcuda.so.1 cannot be found when running python3 -m k2.version.

I think libcuda.so.1 is provided by the host.


Note: You don't need to re-build the icefall docker image. Just start the container and delete that file from the container and re-run nvidia-smi.

@teowenshen
Copy link
Contributor

You are right! I deleted /opt/conda/lib/stubs/libcuda.so.1 and nvidia-smi worked normally. Below is the output from python -m k2.version.

root@d8cf0e860304:~# python -m k2.version
Collecting environment information...

k2 version: 1.24.3
Build type: Release
Git SHA1: 4c05309499a08454997adf500b56dcc629e35ae5
Git date: Tue Jul 25 16:23:36 2023
Cuda used to build k2: 11.6
cuDNN used to build k2: 8.3.2
Python version used to build k2: 3.9
OS used to build k2: CentOS Linux release 7.9.2009 (Core)
CMake version: 3.27.0
GCC version: 9.3.1
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_35,code=sm_35  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 1.13.0+cu116
PyTorch is using Cuda: 11.6
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /opt/conda/lib/python3.9/site-packages/k2/version/version.py
_k2.__file__: /opt/conda/lib/python3.9/site-packages/_k2.cpython-39-x86_64-linux-gnu.so

I also ran a decode.py and can confirm that GPU is being utilized.

@csukuangfj
Copy link
Collaborator Author

@teowenshen

Thank you for testing it.

I have fixed the dockerfile and everything should work as expected.

@csukuangfj csukuangfj merged commit 751bb6f into k2-fsa:master Jul 28, 2023
3 checks passed
@csukuangfj csukuangfj deleted the docker branch July 28, 2023 02:34
@teowenshen
Copy link
Contributor

@csukuangfj

Sorry to reuse this PR, but just a quick note that there is a typo in the link.

image

@csukuangfj
Copy link
Collaborator Author

@csukuangfj

Sorry to reuse this PR, but just a quick note that there is a typo in the link.

image

Thanks! Just fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants