A collection for AI Engineer & Deploy Services
-
- 1. NVIDIA
- 2. Deepstream
- 3. Triton Inference Server
- 4. TAO Toolkit (Transfer-Learning-Toolkit)
-
-
Build OpenCV from source
-
Install Math Kernel Library (MKL/BLAS/LAPACK/OPENBLAS)
You are recommended to install all Math Kernel Library and then compile framework (e.g pytorch, mxnet) from source using custom config for optimization.
Install all LAPACK+BLAS:sudo apt install libjpeg-dev libpng-dev libblas-dev libopenblas-dev libatlas-base-dev liblapack-dev liblapacke-dev gfortran
Install MKL:
# Get the key wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB # now install that key apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB # now remove the public key file exit the root shell rm GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB # Add to apt sudo wget https://apt.repos.intel.com/setup/intelproducts.list -O /etc/apt/sources.list.d/intelproducts.list sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' # Install sudo apt-get update sudo apt-get install intel-mkl-2020.4-912
-
Fresh install NVIDIA driver (PC/Laptop/Workstation)
# Remove old packages sudo apt-get remove --purge '^nvidia-.*' sudo apt-get install ubuntu-desktop sudo apt-get --purge remove "*cublas*" "cuda*" sudo apt-get --purge remove "*nvidia*" sudo add-apt-repository --remove ppa:graphics-drivers/ppa sudo rm /etc/X11/xorg.conf sudo apt autoremove sudo reboot # After restart sudo ubuntu-drivers devices sudo ubuntu-drivers autoinstall sudo reboot
-
Install CuDNN
Install keyring: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#network-repo-installation-for-ubuntu
Install CuDNN9 with CUDA 11sudo apt-get update sudo apt-get -y install cudnn9-cuda-11
-
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
First, make sure that you have "Fresh install NVIDIA driver". If not work, try this bellow
- Make sure the package nvidia-prime is installed:
sudo apt install nvidia-prime
Afterwards, run
sudo prime-select nvidia
- Make sure that NVIDIA is not in blacklist
grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
to find a file containing
blacklist nvidia
and remove it, then runsudo update-initramfs -u
Get boot log
journalctl -b | grep NVIDIA
- If get error
This PCI I/O region assigned to your NVIDIA device is invalid
:
sudo nano /etc/default/grub
edit
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=realloc=off"
sudo update-grub sudo reboot
-
Check current CUDA version
nvcc --version
-
Check current supported CUDA versions
ls /usr/local/
-
Select GPU devices
CUDA_VISIBLE_DEVICES=<index-of-devices> <command> CUDA_VISIBLE_DEVICES=0 python abc.py CUDA_VISIBLE_DEVICES=0 ./sample.sh CUDA_VISIBLE_DEVICES=0,1,2,3 python abc.py CUDA_VISIBLE_DEVICES=0,1,2,3 ./sample.sh
-
Switch CUDA version
CUDA_VER=11.3 export PATH="/usr/local/cuda-$CUDA_VER/bin:$PATH" export LD_LIBRARY_PATH=/usr/local/cuda-$CUDA_VER/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
-
Check NVENV/NVDEC status
nvidia-smi dmon
see the tab %enc and %dec
-
Error with distributed training NCCL (got freezed)
export NCCL_P2P_DISABLE="1"
-
Broken pipe (Distributed training with NCCL)
Run training with argsNCCL_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=INFO torchrun ...
to gather socket name (e.g
eno1
)NCCL INFO NET/IB : No device found. rnd3:77634:79720 [0] NCCL INFO NET/Socket : Using [0]eno1:10.9.3.241<0> rnd3:77634:79720 [0] NCCL INFO Using network Socket
In other nodes, run with arg
NCCL_SOCKET_IFNAME=eno1
-
Install CMake from source
version=3.23 build=2 ## don't modify from here mkdir ~/temp cd ~/temp wget https://cmake.org/files/v$version/cmake-$version.$build.tar.gz tar -xzvf cmake-$version.$build.tar.gz cd cmake-$version.$build/ ./bootstrap make -j8 sudo make install
-
Install NCCL Backend (Distributed training)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb sudo dpkg -i cuda-keyring_1.0-1_all.deb sudo apt-get update sudo apt install libnccl2 libnccl-dev
-
Install MXNet from source
git clone --recursive --branch 1.9.1 https://github.com/apache/incubator-mxnet.git mxnet cd mxnet cp config/linux_gpu.cmake config.cmake rm -rf build mkdir -p build && cd build cmake -DUSE_CUDA=ON -DUSE_CUDNN=OFF -DUSE_MKL_IF_AVAILABLE=OFF -DUSE_MKLDNN=OFF -DUSE_OPENMP=OFF -DUSE_OPENCV=ON -DUSE_BLAS=open .. make -j32 cd ../python pip install --user -e .
-
Tensorflow could not load dynamic library 'cudart64_101.dll'
For above example tensorflow would require CUDA 10.1, please switch to CUDA 10.1 or change tensorflow version which compatible with CUDA version, check here: https://www.tensorflow.org/install/source#gpu -
Fix Deepstream (6.2+) FFMPEG OpenCV installation
Fix some errors about undefined reference & not found of libavcodec, libavutil, libvpx, ...apt-get install --reinstall --no-install-recommends -y libavcodec58 libavcodec-dev libavformat58 libavformat-dev libavutil56 libavutil-dev gstreamer1.0-libav apt install --reinstall gstreamer1.0-plugins-good apt install --reinstall libvpx6 libx264-155 libx265-179 libmpg123-0 libmpeg2-4 libmpeg2encpp-2.1-0 gst-inspect-1.0 | grep 264 rm ~/.cache/gstreamer-1.0/registry.x86_64.bin apt install --reinstall libx264-155 apt-get install gstreamer1.0-libav apt-get install --reinstall gstreamer1.0-plugins-ugly
-
Gstreamer pipeline to convert MP4-MP4 with re-encoding
gst-launch-1.0 filesrc location="<path-to-input>" ! qtdemux ! video/x-h264 ! h264parse ! avdec_h264 ! videoconvert ! x264enc ! h264parse ! qtmux ! filesink location=<path-to-output>
-
Gstreamer pipeline to convert RTSP-RTMP
gst-launch-1.0 rtspsrc location='rtsp://<path-to-rtsp-input>' ! rtph264depay ! h264parse ! flvmux ! rtmpsink location='rtmp://rtmp://<path-to-rtmp-output>'
-
Gstreamer pipeline to convert RTSP-RTMP with reducing resolution
gst-launch-1.0 rtspsrc location='rtsp://<path-to-rtsp-input>' ! rtpbin ! rtph264depay ! h264parse ! avdec_h264 ! videoconvert ! videoscale ! video/x-raw,width=640,height=640 ! x264enc ! h264parse ! flvmux streamable=true ! rtmpsink location='rtmp://<path-to-rtmp-output>'
-