always be killed when build TensorRT engine #743

Burning-XX · 2023-12-26T06:57:03Z

I try to run llama-7b with TensorRT-LLM, when build TensorRT engine as follows:
python3 build.py --model_dir /opt/llms/llama-7b
--dtype float16
--remove_input_padding
--use_gpt_attention_plugin float16
--enable_context_fmha
--use_gemm_plugin float16
--use_inflight_batching
--output_dir /opt/trtModel/llama/1-gpu
but the program always be killed, I am confused。

Luis-xu · 2023-12-26T11:45:18Z

@Burning-XX I think this error is caused by insufficient CPU memory size.

Burning-XX · 2023-12-26T11:55:21Z

@Burning-XX I think this error is caused by insufficient CPU memory size.

you mean GPU memory or CPU memory?

Burning-XX · 2023-12-26T12:00:24Z

@Burning-XX I think this error is caused by insufficient CPU memory size.

you mean GPU memory or CPU memory?

what param could I change to make it success, if I do not have enough memory

Luis-xu · 2023-12-26T12:09:54Z

@Burning-XX I don't know which branch you are using, I solved this problem according to #102 (comment) under version 0.5.0.

jdemouth-nvidia · 2023-12-27T12:41:36Z

How much CPU memory do you have in your system?

NylaWorker · 2023-12-27T14:47:19Z

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

and have for top:

top - 14:43:49 up 17:19, 0 users, load average: 0.00, 0.00, 0.02
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 32089.1 total, 30334.8 free, 1381.9 used, 372.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30280.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                
  1 root      20   0    9136   6272   1792 S   0.0   0.0   0:00.15 bash                                                                                                                                                                   
462 root      20   0    7800   3584   2944 R   0.0   0.0   0:00.01 top

I don't think memory is my issue and building the engine is always killed:

python ../llama/build.py --model_dir ./Mixtral-8x7B-v0.1
--use_inflight_batching
--enable_context_fmha
--use_gemm_plugin
--world_size 2
--pp_size 2
--output_dir ./trt_engines/mixtral/PP
[12/27/2023-14:45:54] [TRT-LLM] [I] Using GPT attention plugin for inflight batching mode. Setting to default 'float16'
[12/27/2023-14:45:54] [TRT-LLM] [I] Using remove input padding for inflight batching mode.
[12/27/2023-14:45:54] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
[12/27/2023-14:45:54] [TRT-LLM] [I] Serially build TensorRT engines.
[12/27/2023-14:45:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 129, GPU 427 (MiB)
[12/27/2023-14:45:58] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2238, GPU 777 (MiB)
[12/27/2023-14:45:58] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[12/27/2023-14:45:58] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.5429 (GiB) Device 0.7592 (GiB)
Killed

Can you help me understand why? Am I doing something wrong? I am running in the docker and it built fine.

NylaWorker · 2023-12-27T15:21:19Z

I was doing this with mixtral, but tried with qwen and I am seeing the same issue:
python build.py --hf_model_dir ./tmp/Qwen/7B/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/
[12/27/2023-15:13:09] [TRT-LLM] [I] Serially build TensorRT engines.
[12/27/2023-15:13:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 116, GPU 427 (MiB)
[12/27/2023-15:13:12] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2225, GPU 777 (MiB)
[12/27/2023-15:13:12] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[12/27/2023-15:13:22] [TRT-LLM] [I] Loading HF QWen ... from ./tmp/Qwen/7B/
/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py:943: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
[12/27/2023-15:13:29] The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
[12/27/2023-15:13:29] Try importing flash-attention for faster inference...
[12/27/2023-15:13:29] Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
[12/27/2023-15:13:29] Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 7.42it/s]
[12/27/2023-15:13:31] [TRT-LLM] [I] HF QWen loaded. Total time: 00:00:08
[12/27/2023-15:13:31] [TRT-LLM] [I] Loading weights from HF QWen...
Converting...: 92%|███████████████████████▉ | 239/259 [00:26<00:01, 10.65it/s]Killed

Burning-XX · 2023-12-28T02:30:14Z

How much CPU memory do you have in your system?

32G

Burning-XX · 2023-12-28T02:30:40Z

0.5.0

also release 0.5.0

byshiue · 2023-12-28T03:01:27Z

Could you try on machine with larger RAM? Also, we suggest trying latest main branch or release branch.

NylaWorker · 2023-12-28T20:57:50Z

It would be good to get requirements of hardware per model. Errors are unclear. I experimented hoping for lower memory consumption, which was highly inefficient.

hello-11 · 2024-11-18T03:17:09Z

@Burning-XX Do you still have the problem? If not, we will close it soon.

DeekshithaDPrakash · 2024-12-04T07:22:01Z

Same issue when creating checkpoint for mistral

python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
0.12.0.dev2024080600
286it [00:50, 5.63it/s]
Total time of reading and converting 52.12702465057373 s
Killed

byshiue self-assigned this Dec 27, 2023

jdemouth-nvidia added the triaged Issue has been triaged by maintainers label Dec 27, 2023

hello-11 added the stale label Nov 18, 2024

nv-guomingz closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

always be killed when build TensorRT engine #743

always be killed when build TensorRT engine #743

Burning-XX commented Dec 26, 2023 •

edited

Loading

Luis-xu commented Dec 26, 2023

Burning-XX commented Dec 26, 2023

Burning-XX commented Dec 26, 2023

Luis-xu commented Dec 26, 2023

jdemouth-nvidia commented Dec 27, 2023

NylaWorker commented Dec 27, 2023

NylaWorker commented Dec 27, 2023

Burning-XX commented Dec 28, 2023

Burning-XX commented Dec 28, 2023

byshiue commented Dec 28, 2023

NylaWorker commented Dec 28, 2023 •

edited

Loading

hello-11 commented Nov 18, 2024

DeekshithaDPrakash commented Dec 4, 2024

always be killed when build TensorRT engine #743

always be killed when build TensorRT engine #743

Comments

Burning-XX commented Dec 26, 2023 • edited Loading

Luis-xu commented Dec 26, 2023

Burning-XX commented Dec 26, 2023

Burning-XX commented Dec 26, 2023

Luis-xu commented Dec 26, 2023

jdemouth-nvidia commented Dec 27, 2023

NylaWorker commented Dec 27, 2023

NylaWorker commented Dec 27, 2023

Burning-XX commented Dec 28, 2023

Burning-XX commented Dec 28, 2023

byshiue commented Dec 28, 2023

NylaWorker commented Dec 28, 2023 • edited Loading

hello-11 commented Nov 18, 2024

DeekshithaDPrakash commented Dec 4, 2024

Burning-XX commented Dec 26, 2023 •

edited

Loading

NylaWorker commented Dec 28, 2023 •

edited

Loading