-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: continuous performance monitoring and PR comment #6283
Changes from 12 commits
4146960
799317b
5c0b2a2
93434fd
5c2f8e6
225f63b
fb3b2f5
bff4644
337c13b
1c1f876
30195d7
fce86c3
4a6bfa9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,279 @@ | ||
# Benchmark | ||
name: Benchmark | ||
|
||
on: | ||
workflow_dispatch: | ||
inputs: | ||
gpu-series: | ||
description: 'Azure GPU series to run with' | ||
required: true | ||
type: choice | ||
options: | ||
- Standard_NC4as_T4_v3 | ||
- Standard_NC24ads_A100_v4 | ||
- Standard_NC80adis_H100_v5 | ||
sha: | ||
description: 'Commit SHA1 to build' | ||
required: false | ||
type: string | ||
duration: | ||
description: 'Duration of the bench' | ||
type: string | ||
default: 10m | ||
|
||
push: | ||
branches: | ||
- master | ||
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*'] | ||
pull_request: | ||
types: [opened, synchronize, reopened] | ||
paths: ['.github/workflows/bench.yml', '**/CMakeLists.txt', '**/Makefile', '**/*.h', '**/*.hpp', '**/*.c', '**/*.cpp', '**/*.cu', '**/*.swift', '**/*.m', 'examples/server/bench/**.*'] | ||
schedule: | ||
- cron: '04 2 * * *' | ||
Comment on lines
+31
to
+32
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need this scheduled run? If so, how will we view the results? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At the moment, it will do the steps not related to PR: commit status and upload artefact. I will later process all commit checks statuses to show performance improvements day after day. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds awesome 👍 It would also be cool if we pile up the daily performance results somewhere and visualize the performance improvement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I want to do something like, probably stored on GH pages. https://home.apache.org/~mikemccand/lucenebench/indexing.html But it will require a little time and logic to reprocess previous commits, taking into account parameters have changed :/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't think we should put much effort in reprocessing previous commits. Better to focus just on the new versions from now on |
||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.ref }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
bench-server-baseline: | ||
runs-on: Standard_NC4as_T4_v3 | ||
env: | ||
RUNNER_LABEL: Standard_NC4as_T4_v3 # FIXME Do not find a way to not duplicate it | ||
N_USERS: 8 | ||
DURATION: 10m | ||
if: ${{ github.event.inputs.gpu-series == 'Standard_NC4as_T4_v3' || github.event.schedule || github.event.pull_request || github.event.push.ref == 'refs/heads/master' }} | ||
steps: | ||
- name: Clone | ||
id: checkout | ||
uses: actions/checkout@v3 | ||
with: | ||
fetch-depth: 0 | ||
ref: ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha || github.head_ref || github.ref_name }} | ||
|
||
- name: Install python env | ||
id: pipenv | ||
run: | | ||
cd examples/server/bench | ||
python3 -m venv venv | ||
source venv/bin/activate | ||
pip install -r requirements.txt | ||
|
||
- name: Prometheus | ||
id: install_prometheus | ||
run: | | ||
wget --quiet https://github.com/prometheus/prometheus/releases/download/v2.51.0/prometheus-2.51.0.linux-amd64.tar.gz | ||
tar xzf prometheus*.tar.gz --strip-components=1 | ||
./prometheus --config.file=examples/server/bench/prometheus.yml & | ||
while ! nc -z localhost 9090; do | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should add a timeout here, just in case something goes wrong There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The workflow will be killed after a while. If you don't mind it can be added later on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah it's not very important, but I still prefer not to rely on CI timeout because it can be long (usually minutes or hours), we should add a timeout of 10 seconds here for example. |
||
sleep 0.1 | ||
done | ||
|
||
- name: Install k6 | ||
id: k6_installation | ||
run: | | ||
cd examples/server/bench | ||
wget --quiet https://github.com/grafana/k6/releases/download/v0.49.0/k6-v0.49.0-linux-amd64.tar.gz | ||
tar xzf k6*.tar.gz --strip-components=1 | ||
|
||
- name: Build | ||
id: cmake_build | ||
run: | | ||
set -eux | ||
mkdir build | ||
cd build | ||
cmake .. \ | ||
-DLLAMA_NATIVE=OFF \ | ||
-DLLAMA_BUILD_SERVER=ON \ | ||
-DLLAMA_CURL=ON \ | ||
-DLLAMA_CUBLAS=ON \ | ||
-DCUDAToolkit_ROOT=/usr/local/cuda \ | ||
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \ | ||
-DCMAKE_CUDA_ARCHITECTURES=75 \ | ||
-DLLAMA_FATAL_WARNINGS=OFF \ | ||
-DLLAMA_ALL_WARNINGS=OFF \ | ||
-DCMAKE_BUILD_TYPE=Release; | ||
cmake --build . --config Release -j $(nproc) --target server | ||
|
||
- name: Download the dataset | ||
id: download_dataset | ||
run: | | ||
cd examples/server/bench | ||
wget --quiet https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json | ||
|
||
- name: Server bench | ||
id: server_bench | ||
run: | | ||
set -eux | ||
|
||
cd examples/server/bench | ||
source venv/bin/activate | ||
BENCH_K6_BIN_PATH=./k6 python bench.py \ | ||
--runner-label ${{ env.RUNNER_LABEL }} \ | ||
--name ${{ github.job }} \ | ||
--branch ${{ github.head_ref || github.ref_name }} \ | ||
--commit ${{ github.event.inputs.sha || github.event.pull_request.head.sha || github.sha }} \ | ||
--scenario script.js \ | ||
--duration ${{ github.event.inputs.duration || env.DURATION }} \ | ||
ngxson marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--hf-repo ggml-org/models \ | ||
--hf-file phi-2/ggml-model-q4_0.gguf \ | ||
--model-path-prefix /models \ | ||
--parallel ${{ env.N_USERS }} \ | ||
-ngl 33 \ | ||
--batch-size 2048 \ | ||
--ubatch-size 256 \ | ||
--ctx-size 16384 \ | ||
--n-prompts 1000 \ | ||
--max-prompt-tokens 1024 \ | ||
--max-tokens 2048 | ||
|
||
cat results.github.env >> $GITHUB_ENV | ||
|
||
# Remove dataset as we do not want it in the artefact | ||
rm ShareGPT_V3_unfiltered_cleaned_split.json | ||
|
||
- uses: actions/upload-artifact@v4 | ||
with: | ||
name: benchmark-results | ||
compression-level: 9 | ||
path: | | ||
examples/server/bench/*.jpg | ||
examples/server/bench/*.json | ||
examples/server/bench/*.log | ||
|
||
- name: Commit status | ||
uses: Sibz/github-status-action@v1 | ||
with: | ||
authToken: ${{secrets.GITHUB_TOKEN}} | ||
sha: ${{ inputs.sha || github.event.pull_request.head.sha || github.sha }} | ||
context: bench-server-baseline | ||
description: | | ||
${{ env.BENCH_RESULTS }} | ||
state: 'success' | ||
|
||
- name: Upload benchmark images | ||
uses: devicons/[email protected] | ||
continue-on-error: true # Important as it looks unstable: 503 | ||
id: imgur_step | ||
with: | ||
client_id: ${{secrets.IMGUR_CLIENT_ID}} | ||
path: | | ||
examples/server/bench/prompt_tokens_seconds.jpg | ||
examples/server/bench/predicted_tokens_seconds.jpg | ||
examples/server/bench/kv_cache_usage_ratio.jpg | ||
examples/server/bench/requests_processing.jpg | ||
|
||
- name: Extract mermaid | ||
id: set_mermaid | ||
run: | | ||
set -eux | ||
|
||
cd examples/server/bench | ||
PROMPT_TOKENS_SECONDS=$(cat prompt_tokens_seconds.mermaid) | ||
echo "PROMPT_TOKENS_SECONDS<<EOF" >> $GITHUB_ENV | ||
echo "$PROMPT_TOKENS_SECONDS" >> $GITHUB_ENV | ||
echo "EOF" >> $GITHUB_ENV | ||
|
||
PREDICTED_TOKENS_SECONDS=$(cat predicted_tokens_seconds.mermaid) | ||
echo "PREDICTED_TOKENS_SECONDS<<EOF" >> $GITHUB_ENV | ||
echo "$PREDICTED_TOKENS_SECONDS" >> $GITHUB_ENV | ||
echo "EOF" >> $GITHUB_ENV | ||
|
||
KV_CACHE_USAGE_RATIO=$(cat kv_cache_usage_ratio.mermaid) | ||
echo "KV_CACHE_USAGE_RATIO<<EOF" >> $GITHUB_ENV | ||
echo "$KV_CACHE_USAGE_RATIO" >> $GITHUB_ENV | ||
echo "EOF" >> $GITHUB_ENV | ||
|
||
REQUESTS_PROCESSING=$(cat requests_processing.mermaid) | ||
echo "REQUESTS_PROCESSING<<EOF" >> $GITHUB_ENV | ||
echo "$REQUESTS_PROCESSING" >> $GITHUB_ENV | ||
echo "EOF" >> $GITHUB_ENV | ||
|
||
- name: Extract image url | ||
id: extract_image_url | ||
continue-on-error: true | ||
run: | | ||
set -eux | ||
|
||
echo "IMAGE_O=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[0] }}" >> $GITHUB_ENV | ||
echo "IMAGE_1=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[1] }}" >> $GITHUB_ENV | ||
echo "IMAGE_2=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[2] }}" >> $GITHUB_ENV | ||
echo "IMAGE_3=${{ fromJSON(steps.imgur_step.outputs.imgur_urls)[3] }}" >> $GITHUB_ENV | ||
|
||
- name: Comment PR | ||
uses: mshick/add-pr-comment@v2 | ||
id: comment_pr | ||
if: ${{ github.event.pull_request != '' }} | ||
with: | ||
message-id: bench-${{ github.job }}-${{ env.RUNNER_LABEL }} | ||
message: | | ||
📈 **llama.cpp server** for _${{ github.job }}_ on _${{ env.RUNNER_LABEL }}_: **${{ env.BENCH_ITERATIONS}} iterations** 🚀 | ||
|
||
- Concurrent users: ${{ env.N_USERS }}, duration: ${{ github.event.inputs.duration || env.DURATION }} | ||
- HTTP request : avg=${{ env.HTTP_REQ_DURATION_AVG }}ms p(90)=${{ env.HTTP_REQ_DURATION_P_90_ }}ms passes=${{ env.HTTP_REQ_FAILED_FAILS }}reqs fails=${{ env.HTTP_REQ_FAILED_PASSES }}reqs Finish reason: stop=${{ env.LLAMACPP_COMPLETIONS_STOP_RATE_PASSES }}reqs truncated=${{ env.LLAMACPP_COMPLETIONS_TRUNCATED_RATE_PASSES }}reqs | ||
- Prompt processing (pp): avg=${{ env.LLAMACPP_PROMPT_TOKENS_AVG }}tk/s p(90)=${{ env.LLAMACPP_PROMPT_TOKENS_P_90_ }}tk/s **total=${{ env.LLAMACPP_PROMPT_TOKENS_TOTAL_COUNTER_RATE }}tk/s** | ||
- Token generation (tg): avg=${{ env.LLAMACPP_TOKENS_SECOND_AVG }}tk/s p(90)=${{ env.LLAMACPP_TOKENS_SECOND_P_90_ }}tk/s **total=${{ env.LLAMACPP_COMPLETION_TOKENS_TOTAL_COUNTER_RATE }}tk/s** | ||
- ${{ env.BENCH_GRAPH_XLABEL }} | ||
|
||
<details> | ||
|
||
<summary>Time series</summary> | ||
|
||
<p align="center"> | ||
|
||
<img width="100%" height="100%" src="${{ env.IMAGE_O }}" alt="prompt_tokens_seconds" /> | ||
|
||
<details> | ||
|
||
<summary>More</summary> | ||
|
||
```mermaid | ||
${{ env.PROMPT_TOKENS_SECONDS }} | ||
``` | ||
|
||
</details> | ||
|
||
<img width="100%" height="100%" src="${{ env.IMAGE_1 }}" alt="predicted_tokens_seconds"/> | ||
|
||
<details> | ||
<summary>More</summary> | ||
|
||
```mermaid | ||
${{ env.PREDICTED_TOKENS_SECONDS }} | ||
``` | ||
|
||
</details> | ||
|
||
</p> | ||
|
||
<details> | ||
|
||
<summary>Details</summary> | ||
|
||
<p align="center"> | ||
|
||
<img width="100%" height="100%" src="${{ env.IMAGE_2 }}" alt="kv_cache_usage_ratio" /> | ||
|
||
<details> | ||
<summary>More</summary> | ||
|
||
```mermaid | ||
${{ env.KV_CACHE_USAGE_RATIO }} | ||
``` | ||
|
||
</details> | ||
|
||
<img width="100%" height="100%" src="${{ env.IMAGE_3 }}" alt="requests_processing"/> | ||
|
||
<details> | ||
<summary>More</summary> | ||
|
||
```mermaid | ||
${{ env.REQUESTS_PROCESSING }} | ||
``` | ||
|
||
</details> | ||
|
||
</p> | ||
</details> | ||
</details> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about excluding
examples/
subdirectories except forexamples/server
? It could help reduce unneeded runsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in another PR if you dont' mind/