forked from ggerganov/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
server: init functional tests (ggerganov#5566)
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <[email protected]>
- Loading branch information
1 parent
5617803
commit 52140fb
Showing
14 changed files
with
1,243 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
# Server build and tests | ||
name: Server | ||
|
||
on: | ||
workflow_dispatch: # allows manual triggering | ||
push: | ||
branches: | ||
- master | ||
- test/server-add-ci-test # FIXME remove | ||
paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*'] | ||
pull_request: | ||
types: [opened, synchronize, reopened] | ||
paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*'] | ||
|
||
jobs: | ||
server: | ||
runs-on: ubuntu-latest | ||
|
||
strategy: | ||
matrix: | ||
build: [noavx, avx2, avx, avx512, cublas, clblast, openblas, kompute, vulkan] | ||
sanitizer: [ADDRESS, THREAD, UNDEFINED] | ||
build_type: [Debug, Release] | ||
include: | ||
- build: 'noavx' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF' | ||
image: ubuntu:latest | ||
- build: 'avx2' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON' | ||
image: ubuntu:latest | ||
- build: 'avx' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX2=OFF' | ||
image: ubuntu:latest | ||
- build: 'avx512' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX512=ON' | ||
image: ubuntu:latest | ||
experimental: true | ||
- build: 'cublas' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_CUBLAS=ON' | ||
image: nvidia/cuda:12.3.1-devel-ubuntu22.04 | ||
arch_not_available: true # require nvidia docker engine | ||
- build: 'clblast' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON' | ||
image: ubuntu:latest | ||
arch_not_available: true | ||
- build: 'openblas' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS' | ||
image: ubuntu:latest | ||
- build: 'kompute' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_KOMPUTE=ON -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON' | ||
image: ubuntu:latest | ||
arch_not_available: true | ||
- build: 'vulkan' | ||
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_VULKAN=ON' | ||
image: ubuntu:latest | ||
arch_not_available: true | ||
|
||
container: | ||
image: ${{ matrix.image }} | ||
ports: | ||
- 8888 | ||
options: --cpus 4 | ||
|
||
steps: | ||
- name: Clone | ||
id: checkout | ||
uses: actions/checkout@v3 | ||
|
||
- name: Dependencies | ||
id: depends | ||
run: | | ||
apt-get update | ||
apt-get -y install \ | ||
build-essential \ | ||
pkg-config \ | ||
git \ | ||
cmake \ | ||
python3-pip \ | ||
wget \ | ||
psmisc | ||
- name: Download CLBlast | ||
id: get_clblast | ||
if: ${{ matrix.build == 'clblast' }} | ||
run: | | ||
apt install -y libclblast-dev | ||
- name: Download OpenBLAS | ||
id: get_openblas | ||
if: ${{ matrix.build == 'openblas' }} | ||
run: | | ||
apt-get -y install libopenblas-dev | ||
- name: Install Vulkan SDK | ||
id: get_vulkan | ||
if: ${{ matrix.build == 'kompute' || matrix.build == 'vulkan' }} | ||
run: | | ||
wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | tee /etc/apt/trusted.gpg.d/lunarg.asc | ||
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list http://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list | ||
apt-get update | ||
apt-get -y install vulkan-sdk | ||
- name: Build | ||
id: cmake_build | ||
run: | | ||
mkdir build | ||
cd build | ||
cmake .. -DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} ${{ matrix.defines }} | ||
cmake --build . --config ${{ matrix.build_type }} -j $(nproc) --target server | ||
- name: Tests dependencies | ||
id: test_dependencies | ||
run: | | ||
pip install -r examples/server/tests/requirements.txt | ||
- name: Download models | ||
id: download_models | ||
run: | | ||
cd examples/server/tests | ||
../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf | ||
- name: Tests | ||
id: server_integration_test | ||
continue-on-error: ${{ matrix.experimental || matrix.arch_not_available }} | ||
run: | | ||
cd examples/server/tests | ||
PORT=8888 ./tests.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Server tests | ||
|
||
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/): | ||
* [issues.feature](./features/issues.feature) Pending issues scenario | ||
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests | ||
* [security.feature](./features/security.feature) Security, CORS and API Key | ||
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc... | ||
|
||
Tests target GitHub workflows job runners with 4 vCPU. | ||
|
||
Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client. | ||
|
||
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`. | ||
|
||
### Install dependencies | ||
`pip install -r requirements.txt` | ||
|
||
### Run tests | ||
1. Build the server | ||
```shell | ||
cd ../../.. | ||
mkdir build | ||
cd build | ||
cmake ../ | ||
cmake --build . --target server | ||
``` | ||
2. download required models: | ||
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf` | ||
3. Start the test: `./tests.sh` | ||
|
||
It's possible to override some scenario steps values with environment variables: | ||
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080` | ||
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server` | ||
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose` | ||
|
||
### Run @bug, @wip or @wrong_usage annotated scenario | ||
|
||
Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope. | ||
- `@bug` annotation aims to link a scenario with a GitHub issue. | ||
- `@wrong_usage` are meant to show user issue that are actually an expected behavior | ||
- `@wip` to focus on a scenario working in progress | ||
|
||
To run a scenario annotated with `@bug`, start: | ||
`DEBUG=ON ./tests.sh --no-skipped --tags bug` | ||
|
||
After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
import os | ||
import socket | ||
import subprocess | ||
import time | ||
from contextlib import closing | ||
from signal import SIGKILL | ||
|
||
|
||
def before_scenario(context, scenario): | ||
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m") | ||
port = 8080 | ||
if 'PORT' in os.environ: | ||
port = int(os.environ['PORT']) | ||
if is_server_listening("localhost", port): | ||
assert False, "Server already started" | ||
|
||
|
||
def after_scenario(context, scenario): | ||
if scenario.status == "failed": | ||
if 'GITHUB_ACTIONS' in os.environ: | ||
print(f"\x1b[33;101mSCENARIO FAILED: {scenario.name} server logs:\x1b[0m\n\n") | ||
if os.path.isfile('llama.log'): | ||
with closing(open('llama.log', 'r')) as f: | ||
for line in f: | ||
print(line) | ||
if not is_server_listening(context.server_fqdn, context.server_port): | ||
print("\x1b[33;101mERROR: Server stopped listening\x1b[0m") | ||
|
||
if not pid_exists(context.server_process.pid): | ||
assert False, f"Server not running pid={context.server_process.pid} ..." | ||
|
||
print(f"stopping server pid={context.server_process.pid} ...") | ||
context.server_process.kill() | ||
# Wait few for socket to free up | ||
time.sleep(0.05) | ||
|
||
attempts = 0 | ||
while is_server_listening(context.server_fqdn, context.server_port): | ||
print(f"stopping server pid={context.server_process.pid} ...") | ||
os.kill(context.server_process.pid, SIGKILL) | ||
time.sleep(0.1) | ||
attempts += 1 | ||
if attempts > 5: | ||
print(f"Server dangling exits, killing all {context.server_path} ...") | ||
process = subprocess.run(['killall', '-9', context.server_path], | ||
stderr=subprocess.PIPE, | ||
universal_newlines=True) | ||
print(process) | ||
|
||
|
||
def is_server_listening(server_fqdn, server_port): | ||
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as sock: | ||
result = sock.connect_ex((server_fqdn, server_port)) | ||
return result == 0 | ||
|
||
|
||
def pid_exists(pid): | ||
"""Check whether pid exists in the current process table.""" | ||
import errno | ||
if pid < 0: | ||
return False | ||
try: | ||
os.kill(pid, 0) | ||
except OSError as e: | ||
return e.errno == errno.EPERM | ||
else: | ||
return True |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# List of ongoing issues | ||
@bug | ||
Feature: Issues | ||
# Issue #5655 | ||
Scenario: Multi users embeddings | ||
Given a server listening on localhost:8080 | ||
And a model file stories260K.gguf | ||
And a model alias tinyllama-2 | ||
And 42 as server seed | ||
And 64 KV cache size | ||
And 2 slots | ||
And continuous batching | ||
And embeddings extraction | ||
Then the server is starting | ||
Then the server is healthy | ||
|
||
Given a prompt: | ||
""" | ||
Write a very long story about AI. | ||
""" | ||
And a prompt: | ||
""" | ||
Write another very long music lyrics. | ||
""" | ||
And a prompt: | ||
""" | ||
Write a very long poem. | ||
""" | ||
And a prompt: | ||
""" | ||
Write a very long joke. | ||
""" | ||
Given concurrent embedding requests | ||
Then the server is busy | ||
Then the server is idle | ||
Then all embeddings are generated |
Oops, something went wrong.