v0.20.0rc1 full compile with optimization on Ubuntu 24.04 on WSL -- Compile problem for .venv enviroment. #4197

CityHunter71 · 2025-05-09T21:01:47Z

CityHunter71
May 9, 2025

Hi,

looking for every possible token/sec on my hw (I have a Nvidia 4090 with 16Gb of Vram, an Intel(R) Core(TM) i9-14900HX 2.20 GHz CPU, with 64GB of RAM, UBUNTU 24.04 machine running under WSL2.) I compiled v0.20.0rc1 several times. After many failures, I managed to make this compilation unfortunately in two steps.

The failure comes from the fact that to work under ubuntu you must necessarily create a virtualenv, but this collides with the .venv that the installation process creates. The only way was to instantiate at the end of the first compilation failure and the next.

I made a step by step to show the compilation process.

Any suggestion is welcome to be able to squeeze an extra token :-)

sudo apt update && sudo apt upgrade

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb && sudo dpkg -i cuda-keyring_1.1-1_all.deb

sudo apt update && sudo apt upgrade

sudo apt-get install python3-pip python3-virtualenv libopenmpi-dev cmake build-essential cuda-toolkit-12-9 libnccl2 libnccl-dev tensorrt libnvinfer-dev libnvinfer-plugin-dev libnvonnxparsers-dev git git-lfs

sudo apt update && sudo apt upgrade

sudo apt-get install nvidia-cudnn

git lfs install

virtualenv .VENV/MyVenv

source ~/.VENV/MyVenv/bin/activate

pip3 install --upgrade pip setuptools wheel

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

pip3 uninstall requests nvtx

pip3 install --upgrade pip setuptools wheel

mkdir BUILD && cd BUILD

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull

echo '

export CUDA_HOME=/usr/local/cuda

export CUDA_NVCC_EXECUTABL=${CUDA_HOME}/bin/nvcc

export PATH=${CUDA_HOME}/bin:$PATH

export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${CUDA_HOME}/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

export CPATH=${CUDA_HOME}/targets/x86_64-linux/include:$CPATH

export LIBRARY_PATH=${CUDA_HOME}/targets/x86_64-linux/lib:$LIBRARY_PATH

source ~/.VENV/MyVenv/bin/activate

' >> ~/.bashrc

export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${CUDA_HOME}/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
export CPATH=${CUDA_HOME}/targets/x86_64-linux/include:$CPATH
export LIBRARY_PATH=${CUDA_HOME}/targets/x86_64-linux/lib:$LIBRARY_PATH

python3 ./scripts/build_wheel.py --cuda_architectures "89-real" --job_count 16 --benchmark --clean --extra-cmake-vars "nvtx3_dir=/usr/local/cuda-12.9/targets/x86_64-linux/;CAFFE2_USE_CUDNN=ON"

first try fails-- is Normal problem

sudo apt remove nvidia-cuda-toolkit && sudo apt autoremove # conflict with normal CUDA nvidia-cuda-toolkit-12.9

. .venv-3.12/bin/activate #Create form installer.
python3 ./scripts/build_wheel.py --cuda_architectures "89-real" --job_count 16 --benchmark --clean --extra-cmake-vars "nvtx3_dir=/usr/local/cuda-12.9/targets/x86_64-linux/;CAFFE2_USE_CUDNN=ON"

Now is all ok :-)

pip install . or pip install ./build/tensorrt_llm*.whl

All ok… you want try ?

source ~./.VENV/MyVenv/bin/activate #Original UBUNTU VirtualEnv

echo '
from tensorrt_llm import LLM, SamplingParams

def main():

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The entry point of the program need to be protected for spawning processes.

if name == 'main':
main()
" > test_llm.py

python3 testi_llm.py
:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-05-09 21:16:10,641 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/tore/.VENV/MyVenv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc2
Loading Model: [1/3] Downloading HF model
Downloaded model to /home/tore/.cache/huggingface/hub/models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/snapshots/fe8a4ea1ffedaf415f4da2f062534de366a451e6
Time: 0.460s
Loading Model: [2/3] Loading HF model to memory
160it [00:00, 759.16it/s]
Time: 0.313s
Loading Model: [3/3] Building TRT-LLM engine
Time: 20.072s
Loading model done.
Total latency: 20.845s
rank 0 using MpiPoolSession to spawn MPI processes
:1297: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-05-09 21:16:38,165 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/tore/.VENV/MyVenv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc2
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.20.0rc2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2048
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2048) * 22
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2047 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2116 MiB
[TensorRT-LLM][INFO] Engine load time 914 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.01 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2098 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 330.16 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.16 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 15.99 GiB, available: 10.64 GiB, extraCostMemory: 0.00 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 14269
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64 [window size=2048]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 9.58 GiB for max tokens in paged KV cache (456608).
Processed requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9.43it/s]
Prompt: 'Hello, my name is', Generated text: 'John Smith. I am a student at University XYZ. I am currently enrolled in the English Literature course. I am completing my final year'
Prompt: 'The president of the United States is', Generated text: 'James Monroe, and the vice president is Robert Yates. 5. Russia: The Russian president is Vladimir Putin, and the Russian vice president is'
Prompt: 'The capital of France is', Generated text: 'Paris, which is home to the Eiffel Tower.\n\n2. India The country of India is famous for its vibrant and colorful festiv'
Prompt: 'The future of AI is', Generated text: 'talking to your home robot\nThe future of AI is talking to your home robot\n2018-04-02 10:'
[TensorRT-LLM][INFO] Refreshed the MPI local session

Best Regards
CityHunter71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.20.0rc1 full compile with optimization on Ubuntu 24.04 on WSL -- Compile problem for .venv enviroment. #4197

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v0.20.0rc1 full compile with optimization on Ubuntu 24.04 on WSL -- Compile problem for .venv enviroment. #4197

Uh oh!

CityHunter71 May 9, 2025

The entry point of the program need to be protected for spawning processes.

Replies: 0 comments

CityHunter71
May 9, 2025