Skip to content

Commit

Permalink
fix vllm quanziation args
Browse files Browse the repository at this point in the history
  • Loading branch information
merrymercy committed Sep 29, 2023
1 parent faca3a3 commit 8e8a604
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 4 deletions.
2 changes: 1 addition & 1 deletion docs/vllm_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ See the supported models [here](https://vllm.readthedocs.io/en/latest/models/sup
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3 --tokenizer hf-internal-testing/llama-tokenizer
```
if you use a awq model, try
If you use an AWQ quantized model, try
'''
python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq
'''
6 changes: 6 additions & 0 deletions fastchat/model/model_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,3 +306,9 @@ def get_model_info(name: str) -> ModelInfo:
"https://huggingface.co/bofenghuang/vigogne-2-7b-chat",
"Vigogne-Chat is a French large language model (LLM) optimized for instruction-following and multi-turn dialogues, developed by Bofeng Huang",
)
register_model_info(
["mistral-7b-instruct"],
"Mistral",
"https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1",
"a large language model by Mistral AI team",
)
3 changes: 0 additions & 3 deletions fastchat/serve/vllm_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,6 @@ async def api_model_details(request: Request):
"--controller-address", type=str, default="http://localhost:21001"
)
parser.add_argument("--model-path", type=str, default="lmsys/vicuna-7b-v1.3")
parser.add_argument("--quantization", type=str)
parser.add_argument(
"--model-names",
type=lambda s: s.split(","),
Expand All @@ -211,8 +210,6 @@ async def api_model_details(request: Request):
args.model = args.model_path
if args.num_gpus > 1:
args.tensor_parallel_size = args.num_gpus
if args.quantization:
args.quantization = args.quantization

engine_args = AsyncEngineArgs.from_cli_args(args)
engine = AsyncLLMEngine.from_engine_args(engine_args)
Expand Down

0 comments on commit 8e8a604

Please sign in to comment.