Super-simple docker container setup for running Qwen3-Omni models (Instruct, Thinking, Captioner) using a custom version of vLLM with Qwen patches and flash attention.
NOTE: This has tiny optimizations for H200. Just fork this repo, clone, and modify the ./start.sh script to add/modify/remove vLLM args as needed.
- instruct (default):
Qwen/Qwen3-Omni-30B-A3B-Instruct - thinking:
Qwen/Qwen3-Omni-30B-A3B-Thinking - captioner:
Qwen/Qwen3-Omni-30B-A3B-Captioner
-
Make scripts executable:
make setup
-
Build the Docker image (defaults to instruct):
Please note: The size of the base image is substantial due to a full CUDA toolkit installation. Ensure you have sufficient disk space - all layers take ~16GiB. A fresh build (without cache) may take 15-30 minutes depending on your internet speed and system performance.
sudo make build
# or specify variant: make build MODEL_VARIANT=thinking-
Download the model (optional, can use HF cache):
make download # or specify variant: make download MODEL_VARIANT=thinkingExample output:
./download.sh instruct
Setting up Python virtual environment in: .venv...
=== Hugging Face download configuration ===
MODEL_VARIANT: instruct
MODEL_REPO: Qwen/Qwen3-Omni-30B-A3B-Instruct
HF_HOME: /var/lib/docker/container_volumes/hf_models
TRANSFORMERS_CACHE: /var/lib/docker/container_volumes/hf_models
HUGGINGFACE_HUB_CACHE: /var/lib/docker/container_volumes/hf_models
Effective HF hub cache: /var/lib/docker/container_volumes/hf_models
Target materialization: HF cache (internal hashed layout)
===========================================
Proceed with download? [y/N]-
Start the container:
make start # or specify variant: make start MODEL_VARIANT=thinking -
Check status:
make status # or specify variant: make status MODEL_VARIANT=thinking
Example output:
./status.sh instruct
=== Docker Container Status ===
Container name: qwen3-omni-30b-a3b-instruct
Expected image: qwen3-omni-vllm:instruct
Expected port: 8901
Model variant: instruct
Model repo: Qwen/Qwen3-Omni-30B-A3B-Instruct
===============================
π Container Information:
Status: running
Image: qwen3-omni-vllm:instruct
Created: 2025-09-24
β
Container is running
Started: 2025-09-24
Port map: 8901/tcp -> 0.0.0.0:8901
8901/tcp -> [::]:8901
π API Access:
Endpoint: http://localhost:8901
Health: http://localhost:8901/health
π Quick Actions:
View logs: docker logs -f qwen3-omni-30b-a3b-instruct
Stop: ./stop.sh instruct
π Connectivity Test:
β
API is responding on port 8901
π Resource Usage:
CPU: 0.72% 4.002GiB / 503.4GiB
Memory: 0.72% 4.002GiB / 503.4GiB-
Stop the container:
make stop # or specify variant: make stop MODEL_VARIANT=thinking -
Test vLLM and Model (e2e test using cURL):
make test-api # or specify variant: make test-api MODEL_VARIANT=thinking
# Build specific variant
make build thinkingmake setup # make scripts executable
# Use scripts with arguments
./build.sh thinking# Set variant for all subsequent commands
export MODEL_VARIANT=thinking
make buildmake help- Show all available commands and examplesmake setup- Make scripts executablemake build [MODEL_VARIANT=x]- Build Docker image for variantmake download [MODEL_VARIANT=x]- Download model filesmake start [MODEL_VARIANT=x]- Start containermake stop [MODEL_VARIANT=x]- Stop containermake status [MODEL_VARIANT=x]- Check container statusmake clean [MODEL_VARIANT=x]- Remove container and image for variantmake clean-all- Remove all containers and imagesmake test-api [MODEL_VARIANT=x]- Test vLLM and Model (e2e test using cURL)make logs [MODEL_VARIANT=x]- View container logsmake logs-follow [MODEL_VARIANT=x]- Follow container logs
All scripts support help with -h or --help:
./build.sh --h
./start.sh -h
./stop.sh --h
./status.sh -h
./download.sh -h
./test-api.sh --h
./logs.sh -h (--follow for tailing logs)The configuration is automatically managed via config.sh based on the model variant:
- Container names:
qwen3-omni-30b-a3b-{variant}(e.g.,qwen3-omni-30b-a3b-thinking) - Image tags:
qwen3-omni-vllm:{variant}(e.g.,qwen3-omni-vllm:captioner) - Model repositories:
Qwen/Qwen3-Omni-30B-A3B-{Variant}(e.g.,Qwen/Qwen3-Omni-30B-A3B-Thinking) - Network aliases: Include variant for isolation
- Port: All variants use port 8901 (only one can run at a time)
- Docker with GPU support
- NVIDIA drivers
- At least 60GB RAM (for 30B model)
- CUDA-compatible GPU with sufficient VRAM
- Make (for convenience commands)
Once running, the API is available at:
- Base URL:
http://localhost:8901 - Health Check:
http://localhost:8901/health - OpenAI-compatible:
http://localhost:8901/v1/chat/completions
Example output:
./test-api.sh instruct
=== Testing API for instruct variant ===
βΉοΈ Container: qwen3-omni-30b-a3b-instruct
βΉοΈ API Base: http://localhost:8901
βΉοΈ Model: Qwen/Qwen3-Omni-30B-A3B-Instruct
=== Health Check ===
β
API health check passed
=== Model Information ===
β
Model info retrieved
βΉοΈ Active model: Qwen/Qwen3-Omni-30B-A3B-Instruct
=== Text Completion Test ===
β
Text completion test passed
βΉοΈ Response: Hello! How can I assist you today?
=== Audio Input Test ===
β
Audio input test passed
βΉοΈ Audio response: It looks like you've shared a piece of text that appears to be a mix of lyrics and possibly some formatting or code-like elements. Let's break it down and clarify what might be going on:
### 1. **Lyrics Analysis**
The main part of your message seems to be a set of lyrics, possibly from a song. Here's a cleaned-up version:
I wonder why
Live a lie
Walk along, along, oh baby
One day, night
The moon
=== Image Input Test ===
β
Image input test passed
βΉοΈ Image response: Based on the image provided, here is a detailed description of its content and meaning.
This is a satirical cartoon that uses a "choose your island wisely" metaphor to compare two different approaches to managing application state in modern web development.
### Scene Description
The image is split into two contrasting islands separated by a body of water.
**Left Island (The "State" Island):**
* **Environment:** This island is depicted as a dark, stormy, and miserable place. It is under
=== Multimodal Input Test ===
β
Multimodal input test passed
βΉοΈ Multimodal response: The audio and image together present a humorous and insightful comparison of two different approaches to state management in modern web development.
**Audio Description:**
A male speaker is talking conversationally about a person's career trajectory. He notes that this individual "wasn't even that big" when he first started listening to him. The speaker then contrasts the person's solo work, which "didn't do overly well," with his later success, stating, "he did very well when he started writing for other people."
**Image Description:**
The image is a cartoon titled "CHOOSE YOUR ISLAND WISELY" that visually represents the two contrasting approaches mentioned in the audio.
* **The Left Island (The "React" Island):** This island
=== Test Summary ===
β
All tests passed! πMODEL_VARIANT: Model variant to use (instruct, thinking, captioner)- Default:
instruct - Example:
export MODEL_VARIANT=thinking - Can be overridden by script arguments
- Default:
-
HF_HOME: Main Hugging Face cache directory (recommended)- Example:
export HF_HOME=/path/to/your/hf/cache - If not set: Uses HF default (
~/.cache/huggingface)
- Example:
-
TRANSFORMERS_CACHE: Transformers library cache directory- If not set: Uses HF default (typically
$HF_HOME/transformers)
- If not set: Uses HF default (typically
-
HUGGINGFACE_HUB_CACHE: Hub cache for downloaded models- If not set: Uses HF default (typically
$HF_HOME/hub)
- If not set: Uses HF default (typically
HF_TOKEN: Hugging Face authentication token- Required for: Private models, gated models, or higher rate limits
- If not set: Script will prompt you to enter token interactively
- Get your token from: https://huggingface.co/settings/tokens
- Example:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Set up for thinking model
export HF_HOME="/path/to/your/hf/cache"
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Build and run specific variant
make setup
make build MODEL_VARIANT=thinking
make download MODEL_VARIANT=thinking
make start MODEL_VARIANT=thinking
# Check status
make status MODEL_VARIANT=thinking
# Stop when done
make stop MODEL_VARIANT=thinking# Build all variants
make setup
make build-all
# Run different models as needed
make start MODEL_VARIANT=instruct
# ... use instruct model ...
make stop MODEL_VARIANT=instruct
make start MODEL_VARIANT=thinking
# ... use thinking model ...
make stop MODEL_VARIANT=thinking
make start MODEL_VARIANT=captioner
# ... use captioner model ...
make stop MODEL_VARIANT=captioner# Use direct scripts for development
chmod +x *.sh
./build.sh thinking
./start.sh thinking
./status.sh thinking
# Check logs
docker logs -f qwen3-omni-30b-a3b-thinking
# Stop and clean up
./stop.sh thinking- Script permissions:
chmod +x *.shormake setup - Makefile issues: Use direct script commands instead
- Container logs:
docker logs -f qwen3-omni-30b-a3b-{variant} - Authentication: Check your HF_TOKEN permissions for the specific model
- Cache issues: Verify HF_HOME directory permissions and disk space
- Port conflicts: Only one model variant can run at a time (all use port 8901)
- GPU memory: Ensure sufficient VRAM for the 30B model
- Invalid variant: Must be one of: instruct, thinking, captioner
# Override model repo in environment
export MODEL_REPO="your-org/custom-qwen3-omni-model"
./start.sh# Use different cache for each variant
HF_HOME="/cache/instruct" ./start.sh instruct
HF_HOME="/cache/thinking" ./start.sh thinkingEach variant gets its own network alias, allowing you to run multiple containers in different networks if needed.
