diff --git a/README.md b/README.md index 42ae31e..2c6bb62 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,13 @@ # llgtrt (llguidance + TensorRT-LLM) -This project demonstrates how to use the [llguidance library](https://github.com/microsoft/llguidance) for constrained output with [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), implementing a REST server compatible with [OpenAI APIs](https://platform.openai.com/docs/api-reference/introduction). +This project implements a REST HTTP server with +[OpenAI-compatible API](https://platform.openai.com/docs/api-reference/introduction), +based on [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) +and [llguidance library](https://github.com/microsoft/llguidance) for constrained output. The server supports regular completions and chat endpoints with JSON schema enforcement ("Structured Output"), as well as full context-free grammars using the [Guidance library](https://github.com/guidance-ai/guidance). -This server is similar in spirit to the [TensorRT-LLM OpenAI server example](./TensorRT-LLM/examples/apps/openai_server.py), but it is Python-free (implemented in Rust) and includes support for constrained output. Like the example above, it **does not** use the NVIDIA Triton Inference Server. +This server is similar in spirit to the [TensorRT-LLM OpenAI server example](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/apps/openai_server.py), but it is Python-free (implemented in Rust) and includes support for constrained output. Like the example above, it **does not** use the NVIDIA Triton Inference Server. ## Structured Output @@ -18,6 +21,8 @@ This approach differs from [Outlines](https://github.com/dottxt-ai/outlines) (wh You will need a Linux machine with an NVIDIA GPU and Docker set up to use the `nvidia-docker` runtime. +So far, we have only tested it on 4xA100 (and single A100). + ## Running Overview of steps: @@ -45,7 +50,10 @@ The build script will initialize submodules if they are missing. It takes about ### Building the TensorRT-LLM Engine -Follow the [TensorRT-LLM Quick-start](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html), adjusted for running in the `llgtrt/llgtrt` container. First, use the `llgtrt/llgtrt` container to run bash. +This is based on the [TensorRT-LLM Quick-start](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html). +Follow the steps here, and look into that guide if needed. + +First, use the `llgtrt/llgtrt` container to run bash. ```bash ./docker/bash.sh --volume /path/to/hf-models:/models