Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: init functional tests #5566

Merged
merged 100 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
157bcf2
server: init functional test
phymbert Feb 18, 2024
9b63d70
server: tests: reduce number of files, all in one tests shell script
phymbert Feb 19, 2024
6497755
server: tests: fix ci workflow
phymbert Feb 19, 2024
4e5245e
server: tests: fix ci workflow
phymbert Feb 19, 2024
30aa323
server: tests: fix ci workflow
phymbert Feb 19, 2024
fe9866a
server: tests: use ngxson llama_xs_q4.bin
phymbert Feb 19, 2024
1680599
server: tests: build only the server
phymbert Feb 19, 2024
8bb586b
server: tests: add health check and concurrent request example
phymbert Feb 20, 2024
6c95ec6
server: tests: change model to: @karpathy's tinyllamas
phymbert Feb 20, 2024
56583be
server: tests: refactor steps and vocabulary
phymbert Feb 20, 2024
9b7ea97
server: tests: add OAI stream test, fix file end of line, fast fail b…
phymbert Feb 20, 2024
11adf1d
server: tests: add OAI multi user scenario
phymbert Feb 20, 2024
c355f76
server: tests: slots endpoint checks
phymbert Feb 20, 2024
367b59a
server: tests: check for infinite loops
phymbert Feb 20, 2024
b9f8390
server: tests: check for infinite loops
phymbert Feb 20, 2024
0772884
server: tests: add a constant seed in completion request
phymbert Feb 20, 2024
6b9dc4f
server: tests: add infinite loop
phymbert Feb 20, 2024
68574c6
server: tests: add infinite loop scenario
phymbert Feb 20, 2024
b0b6d83
server: tests: add infinite loop scenario
phymbert Feb 20, 2024
1ecda0d
server: tests: disable issue 3969 scenario
phymbert Feb 20, 2024
e6d4820
server: tests: add embeddings scenario
phymbert Feb 20, 2024
1065f6d
server: tests: add tokenize/detokenize scenario
phymbert Feb 20, 2024
19664b9
server: tests: detokenize endpoint issue reference added
phymbert Feb 20, 2024
6dcbcfe
server: tests: simplify completion scenario
phymbert Feb 20, 2024
672d98f
server: tests: CORS and api key checks scenario
phymbert Feb 21, 2024
3322bfa
server: tests: add a small check to be sure all started threads have …
phymbert Feb 21, 2024
469af4b
server: tests: change CI workflow trigger
phymbert Feb 21, 2024
2a37bd6
server: tests: fix the multi users infinite loop test
phymbert Feb 21, 2024
f1d4138
server : fix initialization thread issues
ggerganov Feb 21, 2024
600cbeb
server: test: ci change the GitHub workflow trigger
phymbert Feb 21, 2024
68b8d4e
Merge remote-tracking branch 'origin/master' into test/server-add-ci-…
phymbert Feb 21, 2024
6406208
server: tests:
phymbert Feb 21, 2024
01cca66
server: tests: ci fix model download path
phymbert Feb 21, 2024
534998d
server: tests: ci tests.sh exit code
phymbert Feb 21, 2024
a697cd1
minor : fix missing new line
ggerganov Feb 22, 2024
41676d9
ci : actually no reason to exclude GPU code from triggers
ggerganov Feb 22, 2024
016b221
server: fix health/slots endpoint slot state access available race co…
phymbert Feb 22, 2024
e43406e
server: tests: switch to asyncio for concurrent tests, match result c…
phymbert Feb 22, 2024
597c181
server: tests: ci do not take a model anymore, fix trigger patch
phymbert Feb 22, 2024
8b96bda
Merge remote-tracking branch 'origin/master' into test/server-add-ci-…
phymbert Feb 22, 2024
f820e10
server: tests: ci ensure the server is stopped before scenario, and d…
phymbert Feb 22, 2024
aa591ef
server: tests: add Multi users with total number of tokens to predict…
phymbert Feb 22, 2024
26b66c5
server: tests: Fix some random behavior where the wait for busy statu…
phymbert Feb 22, 2024
51f5274
server: tests: ci triggered on any changes on server example path
phymbert Feb 22, 2024
cba6d4e
server: tests: minor fix missing param.
phymbert Feb 22, 2024
1bd07e5
server: tests: assert embeddings are actually computed, make the embe…
phymbert Feb 23, 2024
14b6ede
server: tests: minor color change
phymbert Feb 23, 2024
b38b9e6
server: tests: minor fix server --alias param passed twice
phymbert Feb 23, 2024
70e9055
server: tests: add log in server start to identify why the server doe…
phymbert Feb 23, 2024
2f756f8
server: tests: allow to override the server port before launching tests
phymbert Feb 23, 2024
6a215e5
server: tests: ci adding container to specify server port and allow t…
phymbert Feb 23, 2024
2bb4732
server: tests: ci adding cmake as it is not present by default in ubu…
phymbert Feb 23, 2024
d0e0050
server: tests: ci adding python3-pip as it is not present by default …
phymbert Feb 23, 2024
6e71126
server: tests: ci adding curl as it is not present by default in ubun…
phymbert Feb 23, 2024
6bba3be
server: tests: ci adding psmisc as it is not present by default in ub…
phymbert Feb 23, 2024
5110de0
server: tests: fix coloring console
phymbert Feb 23, 2024
bedf37c
server: tests: reducing n_ctx and n_predict for // prompts as it is t…
phymbert Feb 23, 2024
530d3ae
server: tests: reducing sleep time during scenario
phymbert Feb 23, 2024
36ddb96
server: tests: parallel fix server is started twice, add colors to he…
phymbert Feb 23, 2024
0b0f056
server: tests: ci : build and run tests for all matrix defines, sanit…
phymbert Feb 23, 2024
29f8833
server: tests: ci : fix wget missing
phymbert Feb 23, 2024
12bb797
server: tests: ci : add git
phymbert Feb 23, 2024
68cd1a4
server: tests: ci : matrix cuda
phymbert Feb 23, 2024
86896aa
server: tests: ci : continue on error
phymbert Feb 23, 2024
334902b
server: tests: ci : fix step id duplicated
phymbert Feb 23, 2024
fce2e00
server: tests: ci : fix cuda install
phymbert Feb 23, 2024
e4fb790
server: test: ci fix cuda build
phymbert Feb 23, 2024
2edd995
server: test: ci fix cublas build
phymbert Feb 23, 2024
fa51bac
server: test: ci fix matrix
phymbert Feb 23, 2024
606738e
server: test: ci fix clblast
phymbert Feb 23, 2024
d159e29
server: test: ci fix openblas build
phymbert Feb 23, 2024
13863ef
server: test: ci matrix
phymbert Feb 23, 2024
4d3791a
server: test: ci matrix, experimental on matrix avx512 entry which fa…
phymbert Feb 23, 2024
b94809b
server: test: ci cmake remove all warning as it is done by the classi…
phymbert Feb 23, 2024
5a621e7
server: test: ci make arch not available pass the test
phymbert Feb 23, 2024
54ea4d4
server: test: ax512 experimental
phymbert Feb 23, 2024
5b2ce45
server: test: display server logs in case of failure
phymbert Feb 23, 2024
6dc3af5
server: test: fix CUDA LD PATH
phymbert Feb 23, 2024
83c386f
server: test: ci debug LD path
phymbert Feb 23, 2024
0d380ae
server: test: ci debug CI LD path
phymbert Feb 23, 2024
c75e0e1
server: test: ci switch to nvidia based docker image for cuda
phymbert Feb 23, 2024
2c8bf24
server: test: ci give up with nvidia as it requires the nvidia docker…
phymbert Feb 23, 2024
777bdcf
server: test: ci rename step name to Test, change matrix order for be…
phymbert Feb 23, 2024
e10b83a
server: test: ci rename job name to Server
phymbert Feb 23, 2024
4d27466
server: tests: move all requests call to asyncio
phymbert Feb 23, 2024
1c1fd40
server: tests: allow to pass argument to the test file
phymbert Feb 23, 2024
2109743
server: tests: print server logs only on github action
phymbert Feb 23, 2024
30f802d
server: tests: check if the server has not crashed after a scenario
phymbert Feb 23, 2024
6c0e6f4
server: tests: adding concurrent embedding in issue #5655
phymbert Feb 23, 2024
77b8589
server: tests: linter
phymbert Feb 23, 2024
7183149
server: tests: fix concurrent OAI streaming request
phymbert Feb 23, 2024
2d107ba
server: tests: add a note regarding inference speed.
phymbert Feb 23, 2024
124ca77
server: tests: removing debug print
phymbert Feb 24, 2024
5957a2d
server: tests - allow print on debug
phymbert Feb 24, 2024
482eb30
server: tests - README.md add build instruction and notice on @bug an…
phymbert Feb 24, 2024
60781f0
server: tests - add explanation about KV Cache.
phymbert Feb 24, 2024
a779a4b
server: tests - print only in case of DEBUG
phymbert Feb 24, 2024
a2a928c
server: add link to tests in the README.md
phymbert Feb 24, 2024
5ed4452
server: tests: improved README.md
phymbert Feb 24, 2024
99163c8
github issue template: add link to the tests server framework
phymbert Feb 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions .github/workflows/server-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Server build and tests
name: Server

on:
workflow_dispatch: # allows manual triggering
push:
branches:
- master
- test/server-add-ci-test # FIXME remove
paths: ['.github/workflows/**', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']
pull_request:
types: [opened, synchronize, reopened]
paths: ['**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/**.*']

jobs:
server:
runs-on: ubuntu-latest

strategy:
matrix:
build: [noavx, avx2, avx, avx512, cublas, clblast, openblas, kompute, vulkan]
sanitizer: [ADDRESS, THREAD, UNDEFINED]
build_type: [Debug, Release]
include:
- build: 'noavx'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF'
image: ubuntu:latest
- build: 'avx2'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON'
image: ubuntu:latest
- build: 'avx'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX2=OFF'
image: ubuntu:latest
- build: 'avx512'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_AVX512=ON'
image: ubuntu:latest
experimental: true
- build: 'cublas'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_CUBLAS=ON'
image: nvidia/cuda:12.3.1-devel-ubuntu22.04
arch_not_available: true # require nvidia docker engine
- build: 'clblast'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON'
image: ubuntu:latest
arch_not_available: true
- build: 'openblas'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS'
image: ubuntu:latest
- build: 'kompute'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_KOMPUTE=ON -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON'
image: ubuntu:latest
arch_not_available: true
- build: 'vulkan'
defines: '-DLLAMA_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DLLAMA_VULKAN=ON'
image: ubuntu:latest
arch_not_available: true

container:
image: ${{ matrix.image }}
ports:
- 8888
options: --cpus 4

steps:
- name: Clone
id: checkout
uses: actions/checkout@v3

- name: Dependencies
id: depends
run: |
apt-get update
apt-get -y install \
build-essential \
pkg-config \
git \
cmake \
python3-pip \
wget \
psmisc

- name: Download CLBlast
id: get_clblast
if: ${{ matrix.build == 'clblast' }}
run: |
apt install -y libclblast-dev

- name: Download OpenBLAS
id: get_openblas
if: ${{ matrix.build == 'openblas' }}
run: |
apt-get -y install libopenblas-dev

- name: Install Vulkan SDK
id: get_vulkan
if: ${{ matrix.build == 'kompute' || matrix.build == 'vulkan' }}
run: |
wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | tee /etc/apt/trusted.gpg.d/lunarg.asc
wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list http://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
apt-get update
apt-get -y install vulkan-sdk

- name: Build
id: cmake_build
run: |
mkdir build
cd build
cmake .. -DLLAMA_SANITIZE_${{ matrix.sanitizer }}=ON -DCMAKE_BUILD_TYPE=${{ matrix.build_type }} ${{ matrix.defines }}
cmake --build . --config ${{ matrix.build_type }} -j $(nproc) --target server

- name: Tests dependencies
id: test_dependencies
run: |
pip install -r examples/server/tests/requirements.txt

- name: Download models
id: download_models
run: |
cd examples/server/tests
../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf

- name: Tests
id: server_integration_test
continue-on-error: ${{ matrix.experimental || matrix.arch_not_available }}
run: |
cd examples/server/tests
PORT=8888 ./tests.sh
36 changes: 18 additions & 18 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1410,11 +1410,6 @@ struct llama_server_context
int n_processing_slots = 0;

for (llama_client_slot &slot: slots) {
if (slot.available()) {
n_idle_slots++;
} else {
n_processing_slots++;
}
json slot_data = get_formated_generation(slot);
slot_data["id"] = slot.id;
slot_data["task_id"] = slot.task_id;
Expand All @@ -1429,6 +1424,11 @@ struct llama_server_context
{"stopped_limit", slot.stopped_limit},
{"stopping_word", slot.stopping_word},
};
if (slot_data["state"] == IDLE) {
n_idle_slots++;
} else {
n_processing_slots++;
}
slots_data.push_back(slot_data);
}
LOG_TEE("task %i - slots data: idle=%i processing=%i\n", task.id, n_idle_slots, n_processing_slots);
Expand Down Expand Up @@ -2738,19 +2738,6 @@ int main(int argc, char **argv)
log_data["api_key"] = "api_key: " + std::to_string(sparams.api_keys.size()) + " keys loaded";
}

LOG_INFO("HTTP server listening", log_data);
// run the HTTP server in a thread - see comment below
std::thread t([&]()
{
if (!svr.listen_after_bind())
{
state.store(SERVER_STATE_ERROR);
return 1;
}

return 0;
});

// load the model
if (!llama.load_model(params))
{
Expand Down Expand Up @@ -3218,6 +3205,19 @@ int main(int argc, char **argv)
}*/
//);

LOG_INFO("HTTP server listening", log_data);
// run the HTTP server in a thread - see comment below
std::thread t([&]()
{
if (!svr.listen_after_bind())
{
state.store(SERVER_STATE_ERROR);
return 1;
}

return 0;
});

llama.queue_tasks.on_new_task(std::bind(
&llama_server_context::process_single_task, &llama, std::placeholders::_1));
llama.queue_tasks.on_finish_multitask(std::bind(
Expand Down
23 changes: 23 additions & 0 deletions examples/server/tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Server Integration Test

Server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) with [behave](https://behave.readthedocs.io/en/latest/).

### Install dependencies
`pip install -r requirements.txt`

### Run tests
1. Build the server
phymbert marked this conversation as resolved.
Show resolved Hide resolved
2. download required models:
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf`
3. Start the test: `./tests.sh`

It's possible to override some scenario steps values with environment variables:
- `$PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080`
- `$LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server`

To change the server path, use `LLAMA_SERVER_BIN_PATH` environment variable.
phymbert marked this conversation as resolved.
Show resolved Hide resolved

### Skipped scenario

Feature or Scenario must be annotated with `@llama.cpp` to be included in the scope.
`@bug` annotation aims to link a scenario with a GitHub issue.
64 changes: 64 additions & 0 deletions examples/server/tests/features/environment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import os
import socket
import subprocess
import time
from contextlib import closing
from signal import SIGKILL


def before_scenario(context, scenario):
print(f"\x1b[33;42mStarting new scenario: {scenario.name}!\x1b[0m")
port = 8080
if 'PORT' in os.environ:
port = int(os.environ['PORT'])
if is_server_listening("localhost", port):
assert False, "Server already started"


def after_scenario(context, scenario):
if scenario.status == "failed":
print(f"\x1b[33;101mSCENARIO FAILED: {scenario.name} server logs:\x1b[0m\n\n")
if os.path.isfile('llama.log'):
with closing(open('llama.log', 'r')) as f:
for line in f:
print(line)

if not pid_exists(context.server_process.pid):
assert False, f"Server not running pid={context.server_process.pid} ..."

print(f"stopping server pid={context.server_process.pid} ...")
context.server_process.kill()
# Wait few for socket to free up
time.sleep(0.05)

attempts = 0
while is_server_listening(context.server_fqdn, context.server_port):
print(f"stopping server pid={context.server_process.pid} ...")
os.kill(context.server_process.pid, SIGKILL)
time.sleep(0.1)
attempts += 1
if attempts > 5:
print(f"Server dandling exits, killing all {context.server_path} ...")
phymbert marked this conversation as resolved.
Show resolved Hide resolved
process = subprocess.run(['killall', '-9', context.server_path],
stderr=subprocess.PIPE,
universal_newlines=True)
print(process)


def is_server_listening(server_fqdn, server_port):
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as sock:
result = sock.connect_ex((server_fqdn, server_port))
return result == 0


def pid_exists(pid):
"""Check whether pid exists in the current process table."""
import errno
if pid < 0:
return False
try:
os.kill(pid, 0)
except OSError as e:
return e.errno == errno.EPERM
else:
return True
77 changes: 77 additions & 0 deletions examples/server/tests/features/parallel.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
@llama.cpp
Feature: Parallel

Background: Server startup
Given a server listening on localhost:8080
And a model file stories260K.gguf
And a model alias tinyllama-2
And 42 as server seed
And 64 KV cache size
And 2 slots
And continuous batching
Then the server is starting
Then the server is healthy

Scenario Outline: Multi users completion
Given a prompt:
"""
Write a very long story about AI.
"""
And a prompt:
"""
Write another very long music lyrics.
"""
And <n_predict> max tokens to predict
Given concurrent completion requests
Then the server is busy
Then the server is idle
And all slots are idle
Then all prompts are predicted with <n_predict> tokens
Examples:
| n_predict |
| 128 |

Scenario Outline: Multi users OAI completions compatibility
Given a system prompt You are a writer.
And a model tinyllama-2
Given a prompt:
"""
Write a very long book.
"""
And a prompt:
"""
Write another a poem.
"""
And <n_predict> max tokens to predict
And streaming is <streaming>
Given concurrent OAI completions requests
Then the server is busy
Then the server is idle
Then all prompts are predicted with <n_predict> tokens
Examples:
| streaming | n_predict |
| disabled | 128 |
#| enabled | 64 | FIXME: phymbert: need to investigate why in aiohttp with streaming only one token is generated

Scenario: Multi users with total number of tokens to predict exceeds the KV Cache size #3969
Given a prompt:
"""
Write a very long story about AI.
"""
And a prompt:
"""
Write another very long music lyrics.
"""
And a prompt:
"""
Write a very long poem.
"""
And a prompt:
"""
Write a very long joke.
"""
And 128 max tokens to predict
Given concurrent completion requests
Then the server is busy
Then the server is idle
Then all prompts are predicted
Loading
Loading