diff --git a/Popular_Models_Guide/Llama2/trtllm_guide.md b/Popular_Models_Guide/Llama2/trtllm_guide.md index 7d91dec7..3936d43b 100644 --- a/Popular_Models_Guide/Llama2/trtllm_guide.md +++ b/Popular_Models_Guide/Llama2/trtllm_guide.md @@ -35,21 +35,21 @@ Clone the repo of the model with weights and tokens [here](https://huggingface.c ## Installation -Launch Triton docker container with TensorRT-LLM backend +Launch Triton docker container with TensorRT-LLM backend ```docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-py3 bash``` -Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container. +Alternatively, you can follow instructions [here](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md) to build Tritonserver with Tensorrt-LLM Backend if you want to build a specialized container. Don't forget to allow gpu usage when you launch the container. ## Create Engines for each model [skip this step if you already have an engine] -TensorRT-LLM requires each model to be compiled for the configuration you need before running. -To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want. +TensorRT-LLM requires each model to be compiled for the configuration you need before running. +To do so, before you run your model for the first time on Tritonserver you will need to create a TensorRT-LLM engine for the model for the configuration you want. To do so, you will need to complete the following steps: 1. Install Tensorrt-LLM python package ```bash - # TensorRT-LLM is required for generating engines. + # TensorRT-LLM is required for generating engines. pip install git+https://github.com/NVIDIA/TensorRT-LLM.git mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/ @@ -78,7 +78,7 @@ To do so, you will need to complete the following steps: --world-size 1 ``` - > Optional: You can check test the output of the model with `run.py` + > Optional: You can check test the output of the model with `run.py` > located in the same llama examples folder. > > ```bash @@ -94,10 +94,10 @@ To run our Llama2-7B model, you will need to: 1. Copy over the inflight batcher models repository ```bash - cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/. + cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/. ``` -2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps +2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps ```bash # preprocessing @@ -105,13 +105,13 @@ To run our Llama2-7B model, you will need to: sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt sed -i 's#${tokenizer_dir}#//1-gpu/#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt sed -i 's#${tokenizer_type}#auto#' /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt - + sed -i 's#${decoupled_mode}#false#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt sed -i 's#${engine_dir}#//1-gpu/#' /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt ``` Also, ensure that the `gpt_model_type` parameter is set to `inflight_fused_batching` -3. Launch Tritonserver +3. Launch Tritonserver ```bash tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm diff --git a/README.md b/README.md index 8d158e6f..bf863f25 100644 --- a/README.md +++ b/README.md @@ -15,9 +15,9 @@ The focus of these examples is to demonstrate deployment for models trained with | --------------- | ------------ | --------------- | --------------- | --------------- | #### Supported Model Table -The table below contains a +The table below contains a | Model Name | Supported with HuggingFace format | Supported with TensorRT-LLM Backend | Supported with vLLM Backend | -| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: | +| :-------------: | :------------------------------: | :----------------------------------: | :-------------------------: | | [Llama2-7B](https://ai.meta.com/llama/) | [Llama-2](https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main) |[tutorial](Popular_Models_Guide/Llama2/trtllm_guide.md) | :grey_question:| | [Persimmon-8B](https://www.adept.ai/blog/persimmon-8b) |:white_check_mark: |:grey_question: | :white_check_mark: | | [Falcon-180B](https://falconllm.tii.ae/index.html) |:white_check_mark: |:grey_question: | :white_check_mark: |