-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
text-embeddings-inference updated example trussless #386
Open
michaelfeil
wants to merge
9
commits into
main
Choose a base branch
from
mf-tei-updated-example
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+217
−1,429
Open
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
ba73906
add custom_server example to TEI
michaelfeil 23a0d7b
push: updated config
michaelfeil 15b22c8
update: docker files
michaelfeil ddb4852
dockerfile: rm typo
michaelfeil e01c056
Update config.yaml
michaelfeil 5c96709
tei refactor done
michaelfeil ac27419
readme
michaelfeil 1f8d831
update readme
michaelfeil 67803bd
update readme and config
michaelfeil File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
|
||
model_metadata: | ||
tags: | ||
- openai-compatible | ||
model_name: briton-spec-dec | ||
python_version: py310 | ||
requirements: [] | ||
resources: | ||
accelerator: A10G | ||
cpu: '1' | ||
memory: 24Gi | ||
use_gpu: true | ||
runtime: | ||
predict_concurrency: 1000 | ||
secrets: | ||
hf_access_token: None | ||
trt_llm: | ||
draft: | ||
build: | ||
base_model: deepseek | ||
checkpoint_repository: | ||
repo: deepseek-ai/deepseek-coder-1.3b-instruct | ||
source: HF | ||
max_seq_len: 10000 | ||
plugin_configuration: | ||
use_paged_context_fmha: true | ||
tensor_parallel_count: 1 | ||
runtime: | ||
batch_scheduler_policy: max_utilization | ||
enable_chunked_context: true | ||
kv_cache_free_gpu_mem_fraction: 0.6 | ||
num_draft_tokens: 4 | ||
target: | ||
build: | ||
base_model: deepseek | ||
checkpoint_repository: | ||
repo: deepseek-ai/deepseek-coder-1.3b-instruct | ||
source: HF | ||
max_draft_len: 10 | ||
max_seq_len: 10000 | ||
plugin_configuration: | ||
use_paged_context_fmha: true | ||
speculative_decoding_mode: DRAFT_TOKENS_EXTERNAL | ||
tensor_parallel_count: 1 | ||
runtime: | ||
batch_scheduler_policy: max_utilization | ||
enable_chunked_context: true | ||
kv_cache_free_gpu_mem_fraction: 0.65 | ||
request_default_max_tokens: 1000 | ||
total_token_limit: 500000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
ARG TAG=1.6 | ||
# this image builds a truss-compatible image with the text-embeddings-inference image as base | ||
# it mainly requires python3 | ||
# optional, git and git-lfs are installed to allow for easy cloning of the huggingface model repos. | ||
FROM ghcr.io/huggingface/text-embeddings-inference:${TAG} | ||
RUN apt-get update && apt-get install -y python3 python3-pip git git-lfs | ||
RUN git lfs install | ||
ENTRYPOINT ["text-embeddings-router"] | ||
CMD ["--json-output"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
#!/bin/bash | ||
set -e | ||
|
||
# Map architectures to prefixes | ||
declare -A ARCHES=( | ||
["cpu"]="cpu-" | ||
["turing"]="turing-" | ||
["ampere80"]="" | ||
["ampere86"]="86-" | ||
["adalovelace"]="89-" | ||
["hopper"]="hopper-" | ||
) | ||
|
||
# Define version and target | ||
VERSION="1.6" | ||
TARGET="baseten/text-embeddings-inference-mirror" | ||
|
||
# Build and push images | ||
for ARCH in "${!ARCHES[@]}"; do | ||
ARCH_PREFIX=${ARCHES[$ARCH]} | ||
TAG="${TARGET}:${ARCH_PREFIX}${VERSION}" | ||
|
||
echo "Building and pushing image for $ARCH: $TAG" | ||
|
||
docker buildx build -t "$TAG" --build-arg TAG="${ARCH_PREFIX}${VERSION}" --push . | ||
done | ||
|
||
echo "All images have been built and pushed." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,112 @@ | ||
# Text Embeddings Inference Truss (A100) | ||
This is an example of a Truss model that uses the Text Embeddings Inference API. | ||
# Text Embeddings Inference Truss | ||
|
||
## How to Deploy | ||
In the `config.yaml` file, you can specify the model to use, as well as other arguments per the [Text Embeddings Inference API](https://huggingface.co/docs/text-embeddings-inference) documentation. | ||
Note that not all models are supported by TEI. | ||
This is a Trussless Customer Server example to deploy [text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference), a high performance server that handles text-embeddings, ranranking and classification models as api. | ||
|
||
To run the model, you can use the following command: | ||
```bash | ||
truss push | ||
## Deployment | ||
|
||
Before deployment: | ||
|
||
1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys). | ||
2. Install the latest version of Truss: `pip install --upgrade truss` | ||
3. [Required for gated model] Retrieve your Hugging Face token from the [settings](https://huggingface.co/settings/tokens). Set your Hugging Face token as a Baseten secret [here](https://app.baseten.co/settings/secrets) with the key `hf_access_key`. | ||
|
||
First, clone this repository: | ||
|
||
```sh | ||
git clone https://github.com/basetenlabs/truss-examples.git | ||
cd text-embeddings-inference | ||
``` | ||
|
||
With `text-embeddings-inference` as your working directory, you can deploy the model with the following command, paste your Baseten API key if prompted. | ||
|
||
```sh | ||
truss push --publish | ||
``` | ||
|
||
## Performance Optimization: | ||
|
||
The config.yaml contains a couple of variables that can be tuned, depending on: | ||
- which GPU is used | ||
- which model is deployed | ||
- how many concurrent requests users are sending | ||
|
||
The deployment example is for Bert-large and a Nvidia-L4. Bert-large has a maxiumum sequence length of 512 tokens per sentence. | ||
For Bert-large architecture & the L4, there are marginal gains above a batch-size of 16000 tokens. | ||
|
||
### Concurrent requests | ||
``` | ||
--max-concurrent-requests 40 | ||
# and | ||
runtime: | ||
predict_concurrency : 40 | ||
``` | ||
The following set the number of parallel `post` requests. | ||
In this case we allow 40 parallel requests to be handled per replica & should allow to batch requests from multiple users together, reaching high token counts. Potentially 40 single parallel requests with one sequence each could fully utilize the GPU. `1*40*512=20480` | ||
|
||
|
||
### Tokens per batch | ||
``` | ||
--max-batch-tokens 32768 | ||
``` | ||
|
||
## How to Generate Embeddings | ||
The truss expects: | ||
- "texts" parameter with either a single string or an array of strings. | ||
- "stream" parameter with a boolean value (default is false). | ||
This number of total tokens in a batch. For embedding models, this will determine the VRAM usage. | ||
As most of TEI's models are implemented with `nested` attention implementation, `32768 tokens` could mean `64 sentence with 512 tokens` or `512 sentences with 64 tokens`. While the first will take slightly longer to compute, the peak VRAM usage will stay roughly the same. For `llama` or `mistral` based `7b` embedding models, we recommend setting it a lower setting e.g. | ||
``` | ||
--max-batch-tokens 8192 | ||
``` | ||
michaelfeil marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Client batch size | ||
``` | ||
--max-client-batch-size 32 | ||
``` | ||
Client match size determines the number of sentences in a single request. | ||
Increase if clients cannot send multiple concurrent requests, or if clients require to larger requests size. | ||
|
||
### Endpoint, Model Selection, and OpenAPI | ||
Change to /rerank or /predict if you want to use the rerank or predict endpoint. | ||
Embedding model. | ||
Example supported models: https://huggingface.co/models?pipeline_tag=feature-extraction&other=text-embeddings-inference&sort=trending | ||
```yaml | ||
predict_endpoint: /v1/embeddings | ||
``` | ||
Rerank model. | ||
Example models https://huggingface.co/models?pipeline_tag=text-classification&other=text-embeddings-inference&sort=trending | ||
```yaml | ||
predict_endpoint: /rerank | ||
``` | ||
Classification model: | ||
Example classification model: https://huggingface.co/SamLowe/roberta-base-go_emotions | ||
```yaml | ||
predict_endpoint: /predict | ||
``` | ||
|
||
## Call your model | ||
|
||
### curl | ||
|
||
To generate embeddings, you can use the following command: | ||
```bash | ||
truss predict --d '{"texts": "This is a test"}' | ||
curl -X POST https://model-xxx.api.baseten.co/development/predict \ | ||
-H "Authorization: Api-Key YOUR_API_KEY" \ | ||
-d '{"input": "text string"}' | ||
``` | ||
|
||
# Notes | ||
- The base image is created by installing python on one of the images provided here: https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file. The current example was built for Ampere 80 architecture, which includes the A100. | ||
- Multi-GPU appears to have no impact on performance | ||
- Be aware of the token limit for each embedding model. It is currently up to the caller to ensure that the texts do not exceed the token limit. | ||
|
||
# Improvements | ||
- It may be possible to create a universal base image using the `-all` dockerfile to support a GPU-agnostic implementation | ||
- handle truncation / chunking with averaging (or other technique) when tokens > supported | ||
- investigate impact of dtype on performance | ||
- Add prompt support to embed with prompt | ||
### request python library | ||
|
||
```python | ||
import os | ||
import requests | ||
|
||
resp = requests.post( | ||
"https://model-xxx.api.baseten.co/environments/production/predict", | ||
headers={"Authorization": f"Api-Key {os.environ['BASETEN_API_KEY']}"}, | ||
json={"input": ["text string", "second string"]}, | ||
) | ||
|
||
print(resp.json()) | ||
``` | ||
|
||
|
||
## Support | ||
|
||
If you have any questions or need assistance, please open an issue in this repository or contact our support team. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,30 @@ | ||
base_image: | ||
image: vshulman/ampere-truss-custom-text-embeddings-inference:1.0 | ||
python_executable_path: /usr/bin/python | ||
build: | ||
arguments: | ||
model_id: nomic-ai/nomic-embed-text-v1.5 | ||
model_server: TrussServer | ||
environment_variables: {} | ||
external_package_dirs: [] | ||
model_cache: | ||
- repo_id: nomic-ai/nomic-embed-text-v1.5 | ||
model_metadata: {} | ||
model_name: TEI Experiment | ||
python_version: py39 | ||
runtime: | ||
predict_concurrency: 512 | ||
requirements: [] | ||
# select an image: L4 | ||
# CPU baseten/text-embeddings-inference-mirror:cpu-1.6 | ||
# Turing (T4, ...) baseten/text-embeddings-inference-mirror:turing-1.6 | ||
# Ampere 80 (A100, A30) baseten/text-embeddings-inference-mirror:1.6 | ||
# Ampere 86 (A10, A10G, A40, ...) baseten/text-embeddings-inference-mirror:86-1.6 | ||
# Ada Lovelace (L4, ...) baseten/text-embeddings-inference-mirror:89-1.6 | ||
# Hopper (H100/H100 40GB) baseten/text-embeddings-inference-mirror:hopper-1.6 | ||
image: baseten/text-embeddings-inference-mirror:89-1.6 | ||
model_metadata: | ||
repo_id: BAAI/bge-base-en-v1.5 | ||
docker_server: | ||
start_command: sh -c "text-embeddings-router --port 7997 --model-id /data/local-model --max-client-batch-size 32 --max-concurrent-requests 40 --max-batch-tokens 32768" | ||
readiness_endpoint: /health | ||
liveness_endpoint: /health | ||
# change to /rerank or /predict if you want to use the rerank or predict endpoint | ||
# https://huggingface.github.io/text-embeddings-inference/ | ||
predict_endpoint: /v1/embeddings | ||
server_port: 7997 | ||
resources: | ||
accelerator: A100 | ||
cpu: '1' | ||
memory: 2Gi | ||
accelerator: L4 | ||
use_gpu: true | ||
secrets: {} | ||
system_packages: | ||
- python3.10-venv | ||
model_name: text-embeddings-inference trussless | ||
build_commands: # optional step to download the weights of the model into the image | ||
- git clone https://huggingface.co/BAAI/bge-base-en-v1.5 /data/local-model | ||
runtime: | ||
predict_concurrency : 40 | ||
environment_variables: | ||
VLLM_LOGGING_LEVEL: WARNING | ||
hf_access_token: null |
Empty file.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bert=>BERT