Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

always be killed when build TensorRT engine #743

Closed
Burning-XX opened this issue Dec 26, 2023 · 13 comments
Closed

always be killed when build TensorRT engine #743

Burning-XX opened this issue Dec 26, 2023 · 13 comments
Assignees
Labels
stale triaged Issue has been triaged by maintainers

Comments

@Burning-XX
Copy link

Burning-XX commented Dec 26, 2023

I try to run llama-7b with TensorRT-LLM, when build TensorRT engine as follows:
python3 build.py --model_dir /opt/llms/llama-7b
--dtype float16
--remove_input_padding
--use_gpt_attention_plugin float16
--enable_context_fmha
--use_gemm_plugin float16
--use_inflight_batching
--output_dir /opt/trtModel/llama/1-gpu
but the program always be killed, I am confused。

image image
@Luis-xu
Copy link

Luis-xu commented Dec 26, 2023

@Burning-XX I think this error is caused by insufficient CPU memory size.

@Burning-XX
Copy link
Author

@Burning-XX I think this error is caused by insufficient CPU memory size.

you mean GPU memory or CPU memory?

@Burning-XX
Copy link
Author

@Burning-XX I think this error is caused by insufficient CPU memory size.

you mean GPU memory or CPU memory?

what param could I change to make it success, if I do not have enough memory

@Luis-xu
Copy link

Luis-xu commented Dec 26, 2023

@Burning-XX I don't know which branch you are using, I solved this problem according to #102 (comment) under version 0.5.0.

@byshiue byshiue self-assigned this Dec 27, 2023
@jdemouth-nvidia
Copy link
Collaborator

How much CPU memory do you have in your system?

@jdemouth-nvidia jdemouth-nvidia added the triaged Issue has been triaged by maintainers label Dec 27, 2023
@NylaWorker
Copy link

Hello, I am seeing this and have this for nvidia-smi:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:05:00.0 Off | Off |
| N/A 42C P0 46W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:06:00.0 Off | Off |
| N/A 38C P0 44W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

and have for top:

top - 14:43:49 up 17:19, 0 users, load average: 0.00, 0.00, 0.02
Tasks: 2 total, 1 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 32089.1 total, 30334.8 free, 1381.9 used, 372.4 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30280.9 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                
  1 root      20   0    9136   6272   1792 S   0.0   0.0   0:00.15 bash                                                                                                                                                                   
462 root      20   0    7800   3584   2944 R   0.0   0.0   0:00.01 top                                                                                                                                                                    

I don't think memory is my issue and building the engine is always killed:

python ../llama/build.py --model_dir ./Mixtral-8x7B-v0.1
--use_inflight_batching
--enable_context_fmha
--use_gemm_plugin
--world_size 2
--pp_size 2
--output_dir ./trt_engines/mixtral/PP
[12/27/2023-14:45:54] [TRT-LLM] [I] Using GPT attention plugin for inflight batching mode. Setting to default 'float16'
[12/27/2023-14:45:54] [TRT-LLM] [I] Using remove input padding for inflight batching mode.
[12/27/2023-14:45:54] [TRT-LLM] [I] Using paged KV cache for inflight batching mode.
You are using a model of type mixtral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.
[12/27/2023-14:45:54] [TRT-LLM] [I] Serially build TensorRT engines.
[12/27/2023-14:45:55] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 129, GPU 427 (MiB)
[12/27/2023-14:45:58] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2238, GPU 777 (MiB)
[12/27/2023-14:45:58] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[12/27/2023-14:45:58] [TRT-LLM] [I] [MemUsage] Rank 0 Engine build starts - Allocated Memory: Host 2.5429 (GiB) Device 0.7592 (GiB)
Killed

Can you help me understand why? Am I doing something wrong? I am running in the docker and it built fine.

@NylaWorker
Copy link

I was doing this with mixtral, but tried with qwen and I am seeing the same issue:
python build.py --hf_model_dir ./tmp/Qwen/7B/ --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --output_dir ./tmp/Qwen/7B/trt_engines/fp16/1-gpu/
[12/27/2023-15:13:09] [TRT-LLM] [I] Serially build TensorRT engines.
[12/27/2023-15:13:09] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 116, GPU 427 (MiB)
[12/27/2023-15:13:12] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1973, GPU +350, now: CPU 2225, GPU 777 (MiB)
[12/27/2023-15:13:12] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[12/27/2023-15:13:22] [TRT-LLM] [I] Loading HF QWen ... from ./tmp/Qwen/7B/
/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py:943: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
[12/27/2023-15:13:29] The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
[12/27/2023-15:13:29] Try importing flash-attention for faster inference...
[12/27/2023-15:13:29] Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
[12/27/2023-15:13:29] Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 7.42it/s]
[12/27/2023-15:13:31] [TRT-LLM] [I] HF QWen loaded. Total time: 00:00:08
[12/27/2023-15:13:31] [TRT-LLM] [I] Loading weights from HF QWen...
Converting...: 92%|███████████████████████▉ | 239/259 [00:26<00:01, 10.65it/s]Killed

@Burning-XX
Copy link
Author

How much CPU memory do you have in your system?

32G

@Burning-XX
Copy link
Author

0.5.0

also release 0.5.0

@byshiue
Copy link
Collaborator

byshiue commented Dec 28, 2023

Could you try on machine with larger RAM? Also, we suggest trying latest main branch or release branch.

@NylaWorker
Copy link

NylaWorker commented Dec 28, 2023

It would be good to get requirements of hardware per model. Errors are unclear. I experimented hoping for lower memory consumption, which was highly inefficient.

@hello-11 hello-11 added the stale label Nov 18, 2024
@hello-11
Copy link
Collaborator

@Burning-XX Do you still have the problem? If not, we will close it soon.

@DeekshithaDPrakash
Copy link

Same issue when creating checkpoint for mistral

python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${LLAMA_MODEL} --output_dir ${UNIFIED_CKPT_PATH} --dtype float16

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024080600
0.12.0.dev2024080600
286it [00:50, 5.63it/s]
Total time of reading and converting 52.12702465057373 s
Killed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

8 participants