This repository aims to guide the deployment and testing of multimodal whisper models on both vLLM and TensorRT, docs.
Container images are published in quay.io.
- TensorRT-LLM
quay.io/psap/whisper-poc:latest-trt
. - vLLM
quay.io/psap/whisper-poc:latest-vllm
.
From the repository root folder:
cd containers
podman build -f Containerfile.trt -t quay.io/psap/whisper-poc:latest-trt .
podman push quay.io/psap/whisper-poc:latest-trt
From the repository root folder:
cd containers
podman build -f Containerfile.vllm -t quay.io/psap/whisper-poc:latest-vllm .
podman push quay.io/psap/whisper-poc:latest-vllm
To deploy the containers in a OCP cluster run (from the repository root folder):
- Go into the containers folder
cd containers
. - Run initial steps
./00_pre.sh
. - Deploy a TensorRT-LLM pod with
./01_pod.trt.sh
or - Deploy a vLLM pod with
./01_pod.vllm.sh
.
Connect to the vLLM container with and run the evaluation script with python /workspace/scripts/run_vllm.py
.
oc exec -n my-whisper-runtime -it vllm-standalone -- /bin/bash
Run the script directly:
oc exec -n my-whisper-runtime -it vllm-standalone -- /bin/bash -c "python /workspace/scripts/run_vllm.py"
The current output should look like:
.
.
Elapsed time: 789.2372903823853
Total audio seconds processed: 49507.556
Seconds transcribed / sec: 62.72835382121085
Requests per second: 4.217996337181559 for 3329
.
.
.
Connect to the TensorRT-LLM container with
oc exec -n my-whisper-runtime -it trt-standalone -- /bin/bash
In the container, download and build the model:
bash scripts/trt-download-whisper.sh
bash scripts/trt-build-whisper.sh
And start the Triton inference server:
source ~/scripts/trt-whisper-vars.sh
cd ~/tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size 1 --model_repo=model_repo_whisper/ --tensorrt_llm_model_name tensorrt_llm,whisper_bls --multimodal_gpu0_cuda_mem_pool_bytes 300000000
Alternatively, to do offline inference (don't need to run triton server for this):
source ~/scripts/trt-whisper-vars.sh
cd ~/tensorrtllm_backend/tensorrt_llm/examples/whisper
python3 run.py --engine_dir $output_dir --dataset hf-internal-testing/librispeech_asr_dummy --enable_warmup --name librispeech_dummy_large_v3 --assets_dir ~/assets
For MLCommons/peoples_speech you may need to add a line to run.py to filter out the longer sequences
dataset = dataset.filter(lambda example: example['duration_ms'] < 30000 and example['duration_ms'] > 10000)
Then you can run:
python3 run.py --engine_dir $output_dir --dataset MLCommons/peoples_speech --dataset_name microset --enable_warmup --name peoples_speech --dataset_split train --assets_dir ~/assets --batch_size 64
# for bigger dataset
python3 run.py --engine_dir $output_dir --dataset MLCommons/peoples_speech --dataset_name validation --dataset_split validation --enable_warmup --name peoples_speech --assets_dir ~/assets --batch_size 64
From the root of the repository run:
- Install the collection:
cd psap/topsail
ansible-galaxy collection build --force --output-path releases/
VERSION=$(grep '^version: ' ./galaxy.yml | awk '{print $2}')
ansible-galaxy collection install releases/psap-topsail-$VERSION.tar.gz --force
- Run the playbook:
ansible-playbook playbook_whisper.yml
MY_GALAXY_API_KEY="this_is_a_very_secure_api_key_lol"
ansible-galaxy collection publish \
releases/psap-topsail-$VERSION.tar.gz \
--server https://galaxy.ansible.com \
--ignore-certs \
--verbose \
--api-key $MY_GALAXY_API_KEY
Example of how both CLIs should be aligned to run a specific role:
# Running from the Ansible CLI
ansible-playbook playbook_plotter.yml
# Running from the toolbox CLI
./run_toolbox.py plotter main
Example of how to run a playbook (End to End test scenario):
# Running from the Ansible CLI
ansible-playbook playbook_whisper.yml
# Running from the toolbox CLI
./run_toolbox.py tests whisper
# Create a file with the extra vars
VARS_FILE="./vars.yml"
# Use small or large-v3
# TODO: The variables passed to the CLI should be fetched from the env vars if configured
# TODO: The run_trt.py should be called from a bash script so it can fetch the values from
# the environmental variables.
cat <<EOF > $VARS_FILE
whisper_image: quay.io/psap/whisper-poc:latest-trt
whisper_commands_to_run:
- mkdir -p /tmp/output/
- nvidia-smi > /tmp/output/gpu_status.txt
- bash /home/trt/scripts/trt-build-whisper.sh -m small > /tmp/trt-build-whisper.log 2>&1
- python3 /home/trt/scripts/run_trt.py --engine_dir ~/tensorrtllm_backend/tensorrt_llm/examples/whisper/trt_engines/large_v3_max_batch_64 --dataset hf-internal-testing/librispeech_asr_dummy --enable_warmup --name librispeech_dummy_large_v3 --assets_dir ~/assets --num_beams 4 > /tmp/run_trt.log 2>&1
- python3 /home/trt/scripts/run_vllm_plot.py
EOF
# Running from the Ansible CLI
ansible-playbook playbook_whisper.yml -e @$VARS_FILE
# Update the ansible configuration (ansible.cfg) accordingly.
# python -m ara.setup.ansible
# [defaults]
# callback_plugins=/usr/local/lib/python3.10/dist-packages/ara/plugins/callback
# action_plugins=/usr/local/lib/python3.10/dist-packages/ara/plugins/action
# lookup_plugins=/usr/local/lib/python3.10/dist-packages/ara/plugins/lookup
# Let's make sure the local DB is clean
ara-manage prune --confirm
ansible-playbook playbook_plotter.yml
ara-manage generate ./ara-output
The following steps will allow you to test this PoC. The only external requirement is the OpenShift (oc) command line interface.
git clone https://github.com/openshift-psap/whisper-poc
cd whisper-poc/psap/topsail
python3 -m pip install -r requirements.txt
ansible-galaxy install -r requirements.yml
ansible-galaxy collection build --force --output-path releases/
VERSION=$(grep '^version: ' ./galaxy.yml | awk '{print $2}')
ansible-galaxy collection install releases/psap-topsail-$VERSION.tar.gz --force
export KUBECONFIG=<the path to my kubeconfig>
This wil run the whisper PoC on with less resources in a Nvidia T4
# Create a file with the extra vars
VARS_FILE="./vars.yml"
# Use small or large-v3
# TODO: the variables passed to the CLI should be fetched from the env vars if configured
cat <<EOF > $VARS_FILE
whisper_image: quay.io/psap/whisper-poc:latest-vllm
whisper_commands_to_run:
- mkdir -p /tmp/output/
- nvidia-smi > /tmp/output/gpu_status.txt
- python /workspace/scripts/run_vllm.py --model small --range 100 --batch_sizes 1 2 4 8 16 32 > /tmp/run_vllm.log 2>&1
- python /workspace/scripts/run_vllm_plot.py
EOF
# Running from the Ansible CLI
ansible-playbook playbook_whisper.yml -e @$VARS_FILE
The results are stored by default in the ./whisper_bench-output
folder.
This structure and the results files are subject to change.
user@machine:~/dev/whisper-poc/psap/topsail/whisper_bench-output$ tree
.
├── gpu_metrics.csv
├── gpu_status.txt
├── images
│ ├── gpu_utilization_plot.png
│ ├── memory_utilization_plot.png
│ ├── power_draw_plot.png
│ ├── vllm-latency.png
│ ├── vllm-seconds_transcribed_per_sec.png
│ └── vllm-total_time.png
├── output-vllm-001.json
├── output-vllm-002.json
├── output-vllm-004.json
├── output-vllm-008.json
├── output-vllm-016.json
└── output-vllm-032.json
1 directory, 14 files
If you need an HTML of the execution, by default it is installed ara as a callback plugin.
ara-manage generate ./ara-output
And inspect the output of the ara-output
folder (open the index.html
file),
for more information about how to configure this callback plugin read the
Logging section.