-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on running Qwen/Qwen2-VL-72B-Instruct-AWQ #230
Comments
Related to #231 |
Based on the suggestion #231 from aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face. You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel: Server: VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
--served-model-name qwen2vl \
--model Qwen/Qwen2-VL-72B-Instruct-AWQ \
--tensor-parallel-size 4 \
--max_num_seqs 16 Client: curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2vl",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
]
}' |
please try the reason is that vLLM supports distributed tensor-parallel inference and serving. Currently, we support Megatron-LM’s tensor parallel algorithm. We manage the distributed runtime with Ray. To run distributed inference, install Ray with: in your case: --tensor-parallel-size 2 make vllm use ray library. |
I followed the instructions in the document and after padding, it is possible to use VLLM for parallel processing, but the effect has deteriorated. How should I handle it |
I have following error when I run vllm:
docker run \ --gpus all \ --ipc=host \ --network=host \ --rm \ -v "/home/user/.cache/huggingface:/root/.cache/huggingface" \ --name qwen2 \ -it -p 8000:8000 \ qwenllm/qwenvl:2-cu121 \ vllm serve Qwen/Qwen2-VL-72B-Instruct-AWQ --host 0.0.0.0 --api-key=sample_pass --enforce-eager --tensor-parallel-size 2
I have 2x RTX3090 -- I can run other LLMs on this configuration (on two cards)
The text was updated successfully, but these errors were encountered: