Skip to content

NNDam/AI-Engineer-Note

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Engineer-Note

A collection for AI Engineer & Deploy Services

  • Deeplearning

  • Frameworks

  • Deploy

  • Linux & CUDA & APT-Packages

    • Build OpenCV from source
    • Install Math Kernel Library (MKL/BLAS/LAPACK/OPENBLAS) You are recommended to install all Math Kernel Library and then compile framework (e.g pytorch, mxnet) from source using custom config for optimization.
      Install all LAPACK+BLAS:
      sudo apt install libjpeg-dev libpng-dev libblas-dev libopenblas-dev libatlas-base-dev liblapack-dev liblapacke-dev gfortran 
      

      Install MKL:

      # Get the key
      wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
      # now install that key
      apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
      # now remove the public key file exit the root shell
      rm GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
      # Add to apt
      sudo wget https://apt.repos.intel.com/setup/intelproducts.list -O /etc/apt/sources.list.d/intelproducts.list
      sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
      # Install
      sudo apt-get update
      sudo apt-get install intel-mkl-2020.4-912
      
    • Fresh install NVIDIA driver (PC/Laptop/Workstation)
      # Remove old packages
      sudo apt-get remove --purge '^nvidia-.*'
      sudo apt-get install ubuntu-desktop
      sudo apt-get --purge remove "*cublas*" "cuda*"
      sudo apt-get --purge remove "*nvidia*"
      sudo add-apt-repository --remove ppa:graphics-drivers/ppa
      sudo rm /etc/X11/xorg.conf
      sudo apt autoremove
      sudo reboot
      
      # After restart
      sudo ubuntu-drivers devices
      sudo ubuntu-drivers autoinstall
      sudo reboot
      
    • Install CuDNN

      Install keyring: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#network-repo-installation-for-ubuntu
      Install CuDNN9 with CUDA 11

      sudo apt-get update
      sudo apt-get -y install cudnn9-cuda-11
      
    • NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver

      First, make sure that you have "Fresh install NVIDIA driver". If not work, try this bellow

      • Make sure the package nvidia-prime is installed:
      sudo apt install nvidia-prime
      

      Afterwards, run

      sudo prime-select nvidia
      
      • Make sure that NVIDIA is not in blacklist
      grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*
      

      to find a file containing blacklist nvidia and remove it, then run

      sudo update-initramfs -u
      

      Get boot log

      journalctl -b | grep NVIDIA
      
      • If get error This PCI I/O region assigned to your NVIDIA device is invalid:
      sudo nano /etc/default/grub
      

      edit GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=realloc=off"

      sudo update-grub
      sudo reboot
      
    • Check current CUDA version
      nvcc --version
      
    • Check current supported CUDA versions
      ls /usr/local/
      
    • Select GPU devices
      CUDA_VISIBLE_DEVICES=<index-of-devices> <command>
      CUDA_VISIBLE_DEVICES=0 python abc.py
      CUDA_VISIBLE_DEVICES=0 ./sample.sh
      CUDA_VISIBLE_DEVICES=0,1,2,3 python abc.py
      CUDA_VISIBLE_DEVICES=0,1,2,3 ./sample.sh
      
    • Switch CUDA version
      CUDA_VER=11.3
      export PATH="/usr/local/cuda-$CUDA_VER/bin:$PATH"
      export LD_LIBRARY_PATH=/usr/local/cuda-$CUDA_VER/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
      
    • Check NVENV/NVDEC status
      nvidia-smi dmon
      

      see the tab %enc and %dec

    • Error with distributed training NCCL (got freezed)
      export NCCL_P2P_DISABLE="1"
      
    • Broken pipe (Distributed training with NCCL) Run training with args
      NCCL_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=INFO torchrun ...
      

      to gather socket name (e.g eno1)

      NCCL INFO NET/IB : No device found.
      rnd3:77634:79720 [0] NCCL INFO NET/Socket : Using [0]eno1:10.9.3.241<0>
      rnd3:77634:79720 [0] NCCL INFO Using network Socket
      

      In other nodes, run with arg

      NCCL_SOCKET_IFNAME=eno1    
      
    • Install CMake from source
      version=3.23
      build=2 ## don't modify from here
      mkdir ~/temp
      cd ~/temp
      wget https://cmake.org/files/v$version/cmake-$version.$build.tar.gz
      tar -xzvf cmake-$version.$build.tar.gz
      cd cmake-$version.$build/
      ./bootstrap
      make -j8
      sudo make install
      
    • Install NCCL Backend (Distributed training)
      wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
      sudo dpkg -i cuda-keyring_1.0-1_all.deb
      sudo apt-get update
      sudo apt install libnccl2 libnccl-dev
      
    • Install MXNet from source
      git clone --recursive --branch 1.9.1 https://github.com/apache/incubator-mxnet.git mxnet
      cd mxnet
      cp config/linux_gpu.cmake config.cmake
      rm -rf build
      mkdir -p build && cd build
      cmake -DUSE_CUDA=ON -DUSE_CUDNN=OFF -DUSE_MKL_IF_AVAILABLE=OFF -DUSE_MKLDNN=OFF -DUSE_OPENMP=OFF -DUSE_OPENCV=ON -DUSE_BLAS=open ..
      make -j32
      cd ../python
      pip install --user -e .
      
    • Tensorflow could not load dynamic library 'cudart64_101.dll' For above example tensorflow would require CUDA 10.1, please switch to CUDA 10.1 or change tensorflow version which compatible with CUDA version, check here: https://www.tensorflow.org/install/source#gpu
    • Fix Deepstream (6.2+) FFMPEG OpenCV installation Fix some errors about undefined reference & not found of libavcodec, libavutil, libvpx, ...
      apt-get install --reinstall --no-install-recommends -y libavcodec58 libavcodec-dev libavformat58 libavformat-dev libavutil56 libavutil-dev gstreamer1.0-libav
      apt install --reinstall gstreamer1.0-plugins-good
      apt install --reinstall libvpx6 libx264-155 libx265-179 libmpg123-0 libmpeg2-4 libmpeg2encpp-2.1-0
      gst-inspect-1.0 | grep 264
      rm ~/.cache/gstreamer-1.0/registry.x86_64.bin
      apt install --reinstall libx264-155
      apt-get install gstreamer1.0-libav
      apt-get install --reinstall gstreamer1.0-plugins-ugly
      
    • Gstreamer pipeline to convert MP4-MP4 with re-encoding
      gst-launch-1.0 filesrc location="<path-to-input>" ! qtdemux ! video/x-h264 ! h264parse ! avdec_h264 ! videoconvert ! x264enc ! h264parse ! qtmux ! filesink location=<path-to-output>
      
    • Gstreamer pipeline to convert RTSP-RTMP
      gst-launch-1.0 rtspsrc location='rtsp://<path-to-rtsp-input>' ! rtph264depay ! h264parse ! flvmux ! rtmpsink location='rtmp://rtmp://<path-to-rtmp-output>'
      
    • Gstreamer pipeline to convert RTSP-RTMP with reducing resolution
      gst-launch-1.0 rtspsrc location='rtsp://<path-to-rtsp-input>' ! rtpbin ! rtph264depay ! h264parse ! avdec_h264 ! videoconvert ! videoscale ! video/x-raw,width=640,height=640 ! x264enc ! h264parse ! flvmux streamable=true ! rtmpsink location='rtmp://<path-to-rtmp-output>'