llgtrt (llguidance + TensorRT-LLM)

This project demonstrates how to use llguidance library for constrained output with NVIDIA TensorRT-LLM, implementing a server with OpenAI REST API.

The server supports regular completions and chat endpoints with JSON with schema enforcement ("Structured Output" in OpenAI docs), as well as full context-free grammars using Guidance library.

This server is similar in spirit to TensorRT-LLM OpenAI server example, but python-free and with support for constrained output. Similarly to the example above, it does not use the NVIDIA Triton Inference Server.

Requirements

You will need a Linux machine with NVIDIA GPU and Docker set up to use the nvidia-docker runtime.

Running

Overview of steps:

build llgtrt_prod docker container
build a trtllm engine (likely using the container)
create configuration files
use the container to run the engine

Building the Docker Container

The build script will initialize submodules if missing.

./docker/build.sh

Building the TensorRT-LLM Engine

This is following TensorRT-LLM Quick-start, adjusted for running in the llgtrt_prod container. First, use the llgtrt_prod container to run bash.

./docker/bash.sh --volume /path/to/hf-models:/models

The following steps are done inside of the container:

# convert HF model to a checkpoint
python3 /opt/TensorRT-LLM-examples/llama/convert_checkpoint.py \
    --dtype bfloat16 \
    --model_dir /models/Meta-Llama-3.1-8B-Instruct \
    --output_dir /models/model-ckpt \
    --tp_size 1

# then, run trtllm build
trtllm-build --checkpoint_dir /models/model-ckpt \
    --gemm_plugin bfloat16 \
    --output_dir /models/model-engine \
    --use_paged_context_fmha enable

# clean up ckpt (optional)
rm -rf /models/model-ckpt

# finally, copy tokenizer.json
cp /models/Meta-Llama-3.1-8B-Instruct/tokenizer.json /models/model-engine

# exit the container
exit

Make sure to modify the path to the input model (it needs to contain the HF Transformers config.json as well as the .safetensors files and tokenizer.json). If you're running on more than one 1 GPU, modify the --tp_size argument.

Create config files

If you are running a chat-tuned model, you will need /models/model-engine/chat.json. You may copy one of the chat config files, or else use them as a template to create your own.

You can also modify TensortRT-LLM's runtime configuration with runtime.json file and llguidance_parser configuration with llguidance.json. TODO add more docs

Running the Engine

PORT=3001 ./docker/run.sh /path/to/hf-models/model-engine

The command will print out the actual docker run invocation on first line if you want to invoke it directly later. PORT defaults to 3000.

You can pass additional arguments after the engine path. Try running ./docker/run.sh /path/to/hf-models/model-engine --help for more info. The --help has up-to-date info on chat.json and runtime.json files - the options can be specified either in these files (replace - with _) or on command line.

Development

First build the Docker container to be used in the dev container. If you had already followed steps above, you can skip this. Otherwise, run ./docker/build.sh llgtrt_dev

Next, in VSCode re-open the folder in container.

Credits

The basic structure of the server borrows inspiration from npuichigo/openai_trtllm, which has similar aims, but uses NVidia Triton Server wrapping TensorRT-LLM.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
TensorRT-LLM @ a65dba7		TensorRT-LLM @ a65dba7
derivre @ 02ee497		derivre @ 02ee497
docker		docker
llgtrt		llgtrt
llguidance @ 48bcba4		llguidance @ 48bcba4
scripts		scripts
toktrie @ 5e7013a		toktrie @ 5e7013a
trtllm-c		trtllm-c
trtllm_rs		trtllm_rs
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llgtrt (llguidance + TensorRT-LLM)

Requirements

Running

Building the Docker Container

Building the TensorRT-LLM Engine

Create config files

Running the Engine

Development

Credits

TODO

About

Releases

Packages

Languages

License

tot0/llgtrt

Folders and files

Latest commit

History

Repository files navigation

llgtrt (llguidance + TensorRT-LLM)

Requirements

Running

Building the Docker Container

Building the TensorRT-LLM Engine

Create config files

Running the Engine

Development

Credits

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages