Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor & improve stability #4

Merged
merged 32 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
98dca3c
refactor: use non-quantized default model by default & rename Embeddi…
t83714 Jan 7, 2025
eb4e5f1
update default config value
t83714 Jan 7, 2025
3686785
fixed serverOptions weren't passed through properly
t83714 Jan 7, 2025
c1deeae
- upgrade to @huggingface/transformers 3.2.4
t83714 Jan 7, 2025
5570ed8
fixed build issue
t83714 Jan 7, 2025
8a2449b
v1.1.0-alpha.0
t83714 Jan 7, 2025
db80b70
fixed docker build
t83714 Jan 7, 2025
a833509
Merge branch 'refactor' into release/v1.1.0-alpha.0
t83714 Jan 7, 2025
72fecee
adjust docker build logic to fix sharp installation issue
t83714 Jan 7, 2025
0ea1e49
fix sharp installation in docker
t83714 Jan 7, 2025
bb2bb26
- avoid including unused models in docker images
t83714 Jan 13, 2025
8a58ab0
fixed: .cache folder doesn't have write permission
t83714 Jan 14, 2025
211a160
limit request processing concurrency to 1
t83714 Jan 14, 2025
0c416a0
change default model to q8 quantized version
t83714 Jan 14, 2025
19e3a0f
skip checking embeddingEncoder ready status in readiness probe
t83714 Jan 14, 2025
eb73161
test cases adjustment
t83714 Jan 14, 2025
1d2d818
use worker pool
t83714 Jan 15, 2025
aef657a
fixed broken startup probe
t83714 Jan 15, 2025
ae46442
make sure encode is only called when worker ready
t83714 Jan 15, 2025
ebfc925
set minWorker
t83714 Jan 15, 2025
0099ebe
print debug info
t83714 Jan 15, 2025
c7e6077
print debug info when DEBUG env var set to "true"
t83714 Jan 15, 2025
a5f35b3
use separate session to process string array items
t83714 Jan 15, 2025
e62de4f
clean up code
t83714 Jan 15, 2025
6a19b22
make maxWorkers, minWorkers & workerTaskTimeout configurable
t83714 Jan 15, 2025
73fc76d
move waitTillReady call out of encode function in worker to avoid fut…
t83714 Jan 16, 2025
bdc759f
- set default `workerTaskTimeout` to 60 seconds
t83714 Jan 21, 2025
6f461d0
increase the default memory limit to 2G
t83714 Jan 24, 2025
b5ed686
- set max_length of default model to 1024 due to excessive memory usa…
t83714 Jan 24, 2025
c9d1b8a
tokenizer: only use padding for multiple inputs are received
t83714 Jan 24, 2025
1f70046
set workerTaskTimeout in package.json start script
t83714 Jan 24, 2025
2226460
update changes.md
t83714 Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,23 @@
# CHANGELOG

## v1.1.0

- Rename EmbeddingGenerator to EmbeddingEncoder
- Fixed serverOptions weren't passed through properly in test cases
- Upgrade to @huggingface/transformers v3.2.4
- Upgrade onnxruntime-node v1.20.1
- Avoid including unused models in docker images (smaller image size)
- Increase probe timeout seconds
- Use worker pool
- Process sentence list with separate model runs
- set default `workerTaskTimeout` to `60` seconds
- use quantized version (q8) default model
- set default `limits.memory` to `850M`
- set default replicas number to `2`
- Add max_length config to model config (configurable via helm config)
- set max_length of default model to 1024 due to excessive memory usage when working on text longer than 2048 (the default model supports up to 8192)
- only use padding for multiple inputs received when encoding the input

## v1.0.0

- #1: Initial implementation
- #1: Initial implementation
10 changes: 7 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,16 @@ RUN mkdir -p /usr/src/app /etc/config && \
chmod -R g=u /usr/src/app /etc/config

COPY . /usr/src/app
# make local cache folder writable by 1000 user
RUN chown -R 1000 /usr/src/app/component/node_modules/@xenova/transformers/.cache
# Reinstall onnxruntime-node based on current building platform architecture
RUN cd /usr/src/app/component/node_modules/onnxruntime-node && npm run postinstall
# Reinstall sharp based on current building platform architecture
RUN cd /usr/src/app/component/node_modules/sharp && npm run clean && node install/libvips && node install/dll-copy && node ../prebuild-install/bin.js
RUN cd /usr/src/app/component/node_modules && rm -Rf @img && cd sharp && npm install
# remove downloaded model cache
RUN cd /usr/src/app/component/node_modules/@huggingface/transformers && rm -Rf .cache && mkdir .cache
# download default model
RUN cd /usr/src/app/component && npm run download-default-model
# make local cache folder writable by 1000 user
RUN chown -R 1000 /usr/src/app/component/node_modules/@huggingface/transformers/.cache

USER 1000
WORKDIR /usr/src/app/component
Expand Down
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Kubernetes: `>= 1.21.0`
| appConfig | object | `{}` | Application configuration of the service. You can supply a list of key-value pairs to be used as the application configuration. Currently, the only supported config field is `modelList`. Via the `modelList` field, you can specify a list of LLM models that the service supports. Although you can specify multiple models, only one model will be used at this moment. Each model item have the following fields: <ul> <li> `name` (string): The huggingface registered model name. We only support ONNX model at this moment. This field is required. </li> <li> `default` (bool): Optional; Whether this model is the default model. If not specified, the first model in the list will be the default model. Only default model will be loaded. </li> <li> `quantized` (bool): Optional; Whether the quantized version of model will be used. If not specified, the quantized version model will be loaded. </li> <li> `config` (object): Optional; The configuration object that will be passed to the model. </li> <li> `cache_dir` (string): Optional; The cache directory of the downloaded models. If not specified, the default cache directory will be used. </li> <li> `local_files_only` (bool): Optional; Whether to only load the model from local files. If not specified, the model will be downloaded from the huggingface model hub. </li> <li> `revision` (string) Optional, Default to 'main'; The specific model version to use. It can be a branch name, a tag name, or a commit id. Since we use a git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git. NOTE: This setting is ignored for local requests. </li> <li> `model_file_name` (string) Optional; </li> <li> `extraction_config` (object) Optional; The configuration object that will be passed to the model extraction function for embedding generation. <br/> <ul> <li> `pooling`: ('none' or 'mean' or 'cls') Default to 'none'. The pooling method to use. </li> <li> `normalize`: (bool) Default to true. Whether or not to normalize the embeddings in the last dimension. </li> <li> `quantize`: (bool) Default to `false`. Whether or not to quantize the embeddings. </li> <li> `precision`: ("binary" or "ubinary") default to "binary". The precision to use for quantization. Only used when `quantize` is true. </li> </ul> </li> </ul> Please note: The released docker image only contains "Alibaba-NLP/gte-base-en-v1.5" model. If you specify other models, the server will download the model from the huggingface model hub at the startup. You might want to adjust the `startupProbe` settings to accommodate the model downloading time. Depends on the model size, you might also want to adjust the `resources.limits.memory` & `resources.requests.memory`value. |
| autoscaling.hpa.enabled | bool | `false` | |
| autoscaling.hpa.maxReplicas | int | `3` | |
| autoscaling.hpa.minReplicas | int | `1` | |
| autoscaling.hpa.minReplicas | int | `2` | |
| autoscaling.hpa.targetCPU | int | `90` | |
| autoscaling.hpa.targetMemory | string | `""` | |
| bodyLimit | int | Default to 10485760 (10MB). | Defines the maximum payload, in bytes, that the server is allowed to accept |
Expand Down Expand Up @@ -79,9 +79,11 @@ Kubernetes: `>= 1.21.0`
| livenessProbe.successThreshold | int | `1` | |
| livenessProbe.timeoutSeconds | int | `5` | |
| logLevel | string | `"warn"` | The log level of the application. one of 'fatal', 'error', 'warn', 'info', 'debug', 'trace'; also 'silent' is supported to disable logging. Any other value defines a custom level and requires supplying a level value via levelVal. |
| maxWorkers | int | Default to 1. | The maximum number of workers that run the model to serve the request. |
| minWorkers | int | Default to 1. | The maximum number of workers that run the model to serve the request. |
| nameOverride | string | `""` | |
| nodeSelector | object | `{}` | |
| pluginTimeout | int | Default to 10000 (10 seconds). | The maximum amount of time in milliseconds in which a fastify plugin can load. If not, ready will complete with an Error with code 'ERR_AVVIO_PLUGIN_TIMEOUT'. |
| pluginTimeout | int | Default to 180000 (180 seconds). | The maximum amount of time in milliseconds in which a fastify plugin can load. If not, ready will complete with an Error with code 'ERR_AVVIO_PLUGIN_TIMEOUT'. |
| podAnnotations | object | `{}` | |
| podSecurityContext.runAsNonRoot | bool | `true` | |
| podSecurityContext.runAsUser | int | `1000` | |
Expand All @@ -97,10 +99,10 @@ Kubernetes: `>= 1.21.0`
| readinessProbe.periodSeconds | int | `20` | |
| readinessProbe.successThreshold | int | `1` | |
| readinessProbe.timeoutSeconds | int | `5` | |
| replicas | int | `1` | |
| resources.limits.memory | string | `"1100M"` | the memory limit of the container Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size. When change the default model, be sure to test the peak memory usage of the service before setting the memory limit. quantized model will be used by default, the memory limit is set to 1100M to accommodate the default model size. |
| replicas | int | `2` | |
| resources.limits.memory | string | `"850M"` | the memory limit of the container Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size. When change the default model, be sure to test the peak memory usage of the service before setting the memory limit. When test your model memory requirement, please note that the memory usage of the model often goes much higher with long context length. E.g. the default model supports up to 8192 tokens (default max_length set to 1024), but when the content go beyond 512 tokens, the memory usage will be much higher (requires around 2G). |
| resources.requests.cpu | string | `"100m"` | |
| resources.requests.memory | string | `"650M"` | the memory request of the container Once the model is loaded, the memory usage of the service for serving request would be much lower. Set to 650M for default model. |
| resources.requests.memory | string | `"650M"` | the memory request of the container Once the model is loaded, the memory usage of the service for serving request would be much lower. Set to 850M for default model. |
| service.annotations | object | `{}` | |
| service.httpPortName | string | `"http"` | |
| service.labels | object | `{}` | |
Expand All @@ -120,6 +122,7 @@ Kubernetes: `>= 1.21.0`
| startupProbe.timeoutSeconds | int | `5` | |
| tolerations | list | `[]` | |
| topologySpreadConstraints | list | `[]` | This is the pod topology spread constraints https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ |
| workerTaskTimeout | int | Default to 60000 (60 seconds). | The maximum time in milliseconds that a worker can run before being killed. |

### Build & Run for Local Development

Expand Down
2 changes: 1 addition & 1 deletion deploy/magda-embedding-api/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v2
name: magda-embedding-api
description: An OpenAI embeddings API compatible microservice for Magda.
version: "1.0.0"
version: "1.1.0-alpha.0"
kubeVersion: ">= 1.21.0"
home: "https://github.com/magda-io/magda-embedding-api"
sources: [ "https://github.com/magda-io/magda-embedding-api" ]
Expand Down
6 changes: 6 additions & 0 deletions deploy/magda-embedding-api/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,12 @@ spec:
- "start"
- "dist/app.js"
- "--"
- "--minWorkers"
- {{ .Values.minWorkers | quote }}
- "--maxWorkers"
- {{ .Values.maxWorkers | quote }}
- "--workerTaskTimeout"
- {{ .Values.workerTaskTimeout | quote }}
- "--appConfigFile"
- "/etc/config/appConfig.json"
{{- if .Values.readinessProbe }}
Expand Down
27 changes: 20 additions & 7 deletions deploy/magda-embedding-api/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ logLevel: "warn"

# -- (int) The maximum amount of time in milliseconds in which a fastify plugin can load.
# If not, ready will complete with an Error with code 'ERR_AVVIO_PLUGIN_TIMEOUT'.
# @default -- Default to 10000 (10 seconds).
pluginTimeout:
# @default -- Default to 180000 (180 seconds).
pluginTimeout: 180000

# -- (int) Defines the maximum payload, in bytes, that the server is allowed to accept
# @default -- Default to 10485760 (10MB).
Expand Down Expand Up @@ -57,6 +57,18 @@ closeGraceDelay: 25000
# Depends on the model size, you might also want to adjust the `resources.limits.memory` & `resources.requests.memory`value.
appConfig: {}

# -- (int) The maximum number of workers that run the model to serve the request.
# @default -- Default to 1.
maxWorkers: 1

# -- (int) The maximum number of workers that run the model to serve the request.
# @default -- Default to 1.
minWorkers: 1

# -- (int) The maximum time in milliseconds that a worker can run before being killed.
# @default -- Default to 60000 (60 seconds).
workerTaskTimeout: 60000

# image setting loadding order: (from higher priority to lower priority)
# - Values.image.x
# - Values.defaultImage.x
Expand All @@ -76,12 +88,12 @@ defaultImage:
nameOverride: ""
fullnameOverride: ""

replicas: 1
replicas: 2

autoscaling:
hpa:
enabled: false
minReplicas: 1
minReplicas: 2
maxReplicas: 3
targetCPU: 90
targetMemory: ""
Expand Down Expand Up @@ -191,11 +203,12 @@ resources:
requests:
cpu: "100m"
# -- (string) the memory request of the container
# Once the model is loaded, the memory usage of the service for serving request would be much lower. Set to 650M for default model.
# Once the model is loaded, the memory usage of the service for serving request would be much lower. Set to 850M for default model.
memory: "650M"
limits:
# -- (string) the memory limit of the container
# Due to [this issue of ONNX runtime](https://github.com/microsoft/onnxruntime/issues/15080), the peak memory usage of the service is much higher than the model file size.
# When change the default model, be sure to test the peak memory usage of the service before setting the memory limit.
# quantized model will be used by default, the memory limit is set to 1100M to accommodate the default model size.
memory: "1100M"
# When test your model memory requirement, please note that the memory usage of the model often goes much higher with long context length.
# E.g. the default model supports up to 8192 tokens (default max_length set to 1024), but when the content go beyond 512 tokens, the memory usage will be much higher (requires around 2G).
memory: "850M"
6 changes: 5 additions & 1 deletion deploy/test-deploy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ appConfig:
- name: Xenova/bge-small-en-v1.5
# set quantized to false to use the non-quantized version of the model
# by default, the quantized version of the model will be used
quantized: true
dtype: "q8"
# optional set max length of the input text
# if not set, the value in model config will be used
# if model config does not have max_length, the default value (512) will be used
max_length: 512,
extraction_config:
pooling: "mean"
normalize: true
15 changes: 9 additions & 6 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "magda-embedding-api",
"type": "module",
"version": "1.0.0",
"version": "1.1.0-alpha.0",
"description": "An OpenAI embeddings API compatible microservice for Magda.",
"main": "app.ts",
"directories": {
Expand All @@ -22,7 +22,8 @@
"prebuild": "rimraf dist",
"build": "tsc",
"watch": "tsc -w",
"start": "npm run build && fastify start -l info dist/app.js",
"start": "npm run build && fastify start -T 60000 -l info dist/app.js -- --workerTaskTimeout 10000",
"download-default-model": "node ./dist/loadDefaultModel.js",
"dev": "npm run build && concurrently -k -p \"[{name}]\" -n \"TypeScript,App\" -c \"yellow.bold,cyan.bold\" \"npm:watch\" \"npm:dev:start\"",
"dev:start": "fastify start --ignore-watch=.ts$ -w -l info -P dist/app.js",
"docker-build-local": "create-docker-context-for-node-component --build --push --tag auto --repository example.com",
Expand All @@ -41,20 +42,22 @@
"@fastify/autoload": "^5.10.0",
"@fastify/sensible": "^5.6.0",
"@fastify/type-provider-typebox": "^4.0.0",
"@huggingface/transformers": "^3.2.4",
"@sinclair/typebox": "^0.32.34",
"@xenova/transformers": "^2.17.2",
"fastify": "^4.28.1",
"fastify-cli": "^6.2.1",
"fastify-plugin": "^4.5.1",
"fs-extra": "^11.2.0",
"onnxruntime-node": "^1.14.0"
"onnxruntime-node": "^1.20.1",
"workerpool": "^9.2.0"
},
"devDependencies": {
"@langchain/openai": "^0.2.1",
"@langchain/openai": "^0.2.8",
"@magda/ci-utils": "^1.0.5",
"@magda/docker-utils": "^4.2.1",
"@types/fs-extra": "^11.0.4",
"@types/node": "^18.19.3",
"@types/workerpool": "^6.4.7",
"concurrently": "^8.2.2",
"eslint": "^9.6.0",
"fastify-tsconfig": "^2.0.0",
Expand All @@ -75,7 +78,7 @@
"exclude": [
"test/helper.ts"
],
"timeout": 120,
"timeout": 12000,
"jobs": 1
},
"config": {
Expand Down
6 changes: 5 additions & 1 deletion src/app.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,16 @@ const __dirname = path.dirname(__filename);

export type AppOptions = {
appConfigFile: string;
maxWorkers: number;
minWorkers: number;
// Place your custom options for app below here.
} & Partial<AutoloadPluginOptions>;

// Pass --options via CLI arguments in command to enable these options.
const options: AppOptions = {
appConfigFile: ""
appConfigFile: "",
maxWorkers: 1,
minWorkers: 1
};

const app: FastifyPluginAsync<AppOptions> = async (
Expand Down
Loading
Loading