-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Support Mistral-Nemo #6548
Conversation
Tested merging your commits into 0.5.2 and it works fine. Model works up to 100k tokens(max I can fit into my A100 with fp8 weights/fp8 cache. |
Yes! |
env: VLLM_ATTENTION_BACKEND=XFORMERS CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-Nemo-Instruct-2407-FP8 --gpu-memory-utilization 0.75 --quantization fp8 --host 0.0.0.0 --port 1237 -tp 2 --max-model-len 17000 --served-model-name gpt --trust-remote-code --enable-prefix-caching error: |
@maxin9966 You would need to apply the patch manually. It hasn't been released yet. See https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source. I'm running mistralai/Mistral-Nemo-Instruct-2407 on an A100 with 100k and no issues. Built via docker... DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag nemo-vllm |
@jasonacox Alright, thank you very much. @mgoin Could you please confirm if the latest code supports the mistral-nemo models running in gptq or awq modes? FP8 is a bit too slow. |
Yes, mistral-nemo should have the same quantization support as mistral. |
@w013nad how did you test fp8 with an A100? I thought fp8 was only supported on newer hardware. thanks! |
How much GPU memory is needed by fp16 model and 128K tokens? |
Testing a single A100, 128k max-model-len, dtype=auto, weights take 23GB but full vram running footprint is 57GB. I'm getting average 42 TPS per session with aggregate throughput of 1,422 TPS using 512 concurrent threads (load testing). Docker: git clone https://github.com/vllm-project/vllm.git
cd vllm
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-nemo docker run -d --runtime nvidia --gpus '"device=0"' \
-v ${PWD}/models:/root/.cache/huggingface \
-p 8000:8000 \
-e NVIDIA_DISABLE_REQUIRE=true \
--env "HF_TOKEN=*******" \
--ipc=host \
--name vllm \
--restart unless-stopped \
vllm-nemo \
--model mistralai/Mistral-Nemo-Instruct-2407 \
--max-model-len 128000 \
--tensor-parallel-size 1 |
This is great! Do you if know if it will work like this with a LoRA Adapter currently? |
Tested with FP8 on 2A100s getting 86.60 tok/s |
@tensimixt I would love to see what aggregate (concurrent) tok/s you get with that setup. I use this simple load generator: https://github.com/jasonacox/TinyLLM/blob/main/loadtest.py |
@simon-mo the latest docker image will include this next week? Thanks |
Need help, why I can't use fp8?
|
Signed-off-by: Alvant <[email protected]>
I'm getting |
Ensure that your input sequence length doesn’t exceed the model’s maximum limit. Trim or truncate the input to fit within 1024 tokens. |
Initialization config
From model's config.json:
Input length is irrelevant, because it's |
@Isotr0py would you have an idea about GGUF issues with this architecture? |
@vladfaust Can you try updating to the latest |
@Isotr0py yep, it works with the latest vLLM. |
FIX #6545
Patch was ported from huggingface/transformers#32050
Essentially there was a new
head_dim
override added to MistralConfig. We will look for that optional argument in the config and default to the previousself.hidden_size // self.total_num_heads
behavior.We have also produced and validated a FP8 quantized checkpoint: https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8
Note that by default it will use a very large model length (128k) and may need
max_model_len
to be specified.