Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Deepseek instruct + update manifest #379

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 28 additions & 3 deletions docs/dev-docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,36 @@ You can pull the image with `docker pull rocm/vllm-dev:main`

- ROCm 6.3 support
- Potential bug with Tunable Ops not saving due to a PyTorch issue
- [Experimental DeepSeekV3 support](###Running-DeepseekV3)

Gemms are tuned using PyTorch's Tunable Ops feature (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md)
The gemms are automatically enabled in the docker image, and all stored gemm configs are kept in /app/_gemm_csv in the same image

### Running DeepseekV3

docker run -it --rm --ipc=host --network=host --group-add render \
--privileged --security-opt seccomp=unconfined \
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
-e VLLM_USE_TRITON_FLASH_ATTN=0 \
-e VLLM_FP8_PADDING=0 \
rocm/vllm-dev:main
# Online serving
vllm serve deepseek-ai/DeepSeek-V3 \
--disable-log-requests \
--tensor-parallel-size 8 \
--trust-remote-code \
--max-model-len 32768
# Offline throughput
python3 /app/vllm/benchmarks/benchmark_throughput.py --model deepseek-ai/DeepSeek-V3 \
--input-len <> --output-len <> --tensor-parallel-size 8 \
--quantization fp8 --kv-cache-dtype fp8 --dtype float16 \
--max-model-len 32768 --trust-remote-code
# Offline Latency
python benchmarks/benchmark_latency.py --model deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 --trust-remote-code --max-model-len 32768 \
--batch-size <> --input-len <> --output-len <>

### Reproducing benchmark results

### Use pre-quantized models
Expand Down Expand Up @@ -158,7 +184,6 @@ Some environment variables enhance the performance of the vLLM kernels and PyTor

export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_MIN_NCHANNELS=112
export VLLM_FP8_PADDING=1

You can set both PYTORCH_TUNABLEOP_ENABLED and PYTORCH_TUNABLEOP_TUNING to 1 to performance GEMM tuning for the 1st benchmark run.
It will take some time to complete the tuning during the benchmark. After tuning, it will generate several csv files as the performance lookup database. For the subsequent benchmark runs, you can keep
Expand Down Expand Up @@ -416,13 +441,13 @@ Please refer to the MLPerf instructions for recreating the MLPerf numbers.

Updated:

vLLM: <https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2e2b>
vLLM: <https://github.com/ROCm/vllm/commit/c5a9406b8915cb531d4696984f6e28a8214f9dc6>

### Docker Manifest

To reproduce the release docker:

git clone https://github.com/ROCm/vllm.git
cd vllm
git checkout 2c60adc83981ada77a77b2adda78ef109d2e2e2b
git checkout c5a9406b8915cb531d4696984f6e28a8214f9dc6
docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
Loading