v0.20.0rc1 full compile with optimization on Ubuntu 24.04 on WSL -- Compile problem for .venv enviroment. #4197
CityHunter71
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
looking for every possible token/sec on my hw (I have a Nvidia 4090 with 16Gb of Vram, an Intel(R) Core(TM) i9-14900HX 2.20 GHz CPU, with 64GB of RAM, UBUNTU 24.04 machine running under WSL2.) I compiled v0.20.0rc1 several times. After many failures, I managed to make this compilation unfortunately in two steps.
The failure comes from the fact that to work under ubuntu you must necessarily create a virtualenv, but this collides with the .venv that the installation process creates. The only way was to instantiate at the end of the first compilation failure and the next.
I made a step by step to show the compilation process.
Any suggestion is welcome to be able to squeeze an extra token :-)
sudo apt update && sudo apt upgrade
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update && sudo apt upgrade
sudo apt-get install python3-pip python3-virtualenv libopenmpi-dev cmake build-essential cuda-toolkit-12-9 libnccl2 libnccl-dev tensorrt libnvinfer-dev libnvinfer-plugin-dev libnvonnxparsers-dev git git-lfs
sudo apt update && sudo apt upgrade
sudo apt-get install nvidia-cudnn
git lfs install
virtualenv .VENV/MyVenv
source ~/.VENV/MyVenv/bin/activate
pip3 install --upgrade pip setuptools wheel
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip3 uninstall requests nvtx
pip3 install --upgrade pip setuptools wheel
mkdir BUILD && cd BUILD
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull
echo '
export CUDA_HOME=/usr/local/cuda
export CUDA_NVCC_EXECUTABL=${CUDA_HOME}/bin/nvcc
export PATH=${CUDA_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${CUDA_HOME}/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
export CPATH=${CUDA_HOME}/targets/x86_64-linux/include:$CPATH
export LIBRARY_PATH=${CUDA_HOME}/targets/x86_64-linux/lib:$LIBRARY_PATH
source ~/.VENV/MyVenv/bin/activate
' >> ~/.bashrc
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${CUDA_HOME}/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
export CPATH=${CUDA_HOME}/targets/x86_64-linux/include:$CPATH
export LIBRARY_PATH=${CUDA_HOME}/targets/x86_64-linux/lib:$LIBRARY_PATH
python3 ./scripts/build_wheel.py --cuda_architectures "89-real" --job_count 16 --benchmark --clean --extra-cmake-vars "nvtx3_dir=/usr/local/cuda-12.9/targets/x86_64-linux/;CAFFE2_USE_CUDNN=ON"
first try fails-- is Normal problem
sudo apt remove nvidia-cuda-toolkit && sudo apt autoremove # conflict with normal CUDA nvidia-cuda-toolkit-12.9
. .venv-3.12/bin/activate #Create form installer.
python3 ./scripts/build_wheel.py --cuda_architectures "89-real" --job_count 16 --benchmark --clean --extra-cmake-vars "nvtx3_dir=/usr/local/cuda-12.9/targets/x86_64-linux/;CAFFE2_USE_CUDNN=ON"
Now is all ok :-)
pip install . or pip install ./build/tensorrt_llm*.whl
All ok… you want try ?
source ~./.VENV/MyVenv/bin/activate #Original UBUNTU VirtualEnv
echo '
from tensorrt_llm import LLM, SamplingParams
def main():
The entry point of the program need to be protected for spawning processes.
if name == 'main':
main()
" > test_llm.py
python3 testi_llm.py
:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-05-09 21:16:10,641 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/tore/.VENV/MyVenv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc2
Loading Model: [1/3] Downloading HF model
Downloaded model to /home/tore/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6
Time: 0.460s
Loading Model: [2/3] Loading HF model to memory
160it [00:00, 759.16it/s]
Time: 0.313s
Loading Model: [3/3] Building TRT-LLM engine
Time: 20.072s
Loading model done.
Total latency: 20.845s
rank 0 using MpiPoolSession to spawn MPI processes
:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-05-09 21:16:38,165 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/tore/.VENV/MyVenv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc2
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.20.0rc2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2116 MiB
[TensorRT-LLM][INFO] Engine load time 914 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.01 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 330.16 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.16 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 15.99 GiB, available: 10.64 GiB, extraCostMemory: 0.00 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14269
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64 [window size=2048]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.58 GiB for max tokens in paged KV cache (456608).
Processed requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9.43it/s]
Prompt: 'Hello, my name is', Generated text: 'John Smith. I am a student at University XYZ. I am currently enrolled in the English Literature course. I am completing my final year'
Prompt: 'The president of the United States is', Generated text: 'James Monroe, and the vice president is Robert Yates. 5. Russia: The Russian president is Vladimir Putin, and the Russian vice president is'
Prompt: 'The capital of France is', Generated text: 'Paris, which is home to the Eiffel Tower.\n\n2. India The country of India is famous for its vibrant and colorful festiv'
Prompt: 'The future of AI is', Generated text: 'talking to your home robot\nThe future of AI is talking to your home robot\n2018-04-02 10:'
[TensorRT-LLM][INFO] Refreshed the MPI local session
Best Regards
CityHunter71
Beta Was this translation helpful? Give feedback.
All reactions