LLProxy
was designed for the task of effectively managing rate limits and scheduling of workload across multiple different LLM based applications. The rate limits for these services are complex, beyond what can easily be configured with the simplest of reverse proxies. LLProxy
addresses this by creating a scheduler that deeply understandings the core LLM providers rate limiting behavior.
- The following providers are currently supported: [
openai
] - The following scheduling is currently supported: [
FIFO
]
-
Setup your configuration file:
cp config-example.json config.json
Each provider can be defined as a specific route.
config.json
{ "routes": { "openai": { "forward": "https://api.openai.com", "provider": "openai", "models": { "gpt-4": { "maxQueueSize": 10, "maxQueueWait": 30, "rpm": 200, "tpm": 40000 }, ... } } ... } }
The above creates a route http://proxyhost:8080/openai/... that routes all traffic sent to that route to https://api.openai.com/...
It further defines a scheduler for the gpt-4 model that sets:
maxQueueSize
defines how many requests are allowed to sit in the queue prior to being scheduledmaxQueueWait
defines how long, in seconds, it will allow a request to wait before it starts rejecting additional requests withRateLimit
errors.rpm
the maximum requests per minutetpm
the maximum tokens per minute
Requests and tokens per minute are consumed as requests come in and recover over time. If a request cannot be immediately processed then it will sit in the queue for up to
maxQueueWait
seconds, and up tomaxQueueSize
items can be outstanding in the queue.Set a config for every model you want to support.
-
[Optional] Run tests
./test.sh
-
[Optional] Look at code coverage
go tool cover -html=coverage.out -o coverage.html
-
Build the application
./build.sh
-
Run the application
./llproxy
-
Direct traffic to your proxy server
import openai openai.api_base = 'http://<your-proxy-address>:8080/openai/v1' ...