Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Support Pixtral models in the HF Transformers format #9036

Merged
merged 15 commits into from
Oct 18, 2024

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Oct 3, 2024

FIX #8566
FIX #8685
FIX #9069

Introduces PixtralHF, which is a model implementing HF's format of Pixtral. Based off https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/modeling_pixtral.py

Tested with:

This model implementation follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders. The model uses [PixtralVisionModel] for its vision encoder, and [MistralForCausalLM] for its language decoder.

Example output from python examples/offline_inference_vision_language.py --model pixtral_hf:

The image features a prominent structure in the background, which is the Tokyo Skytree, a broadcasting and observation tower located in Tokyo, Japan. The Tokyo Skytree is the tallest tower in the world and is known for its distinctive lattice structure and spherical observation decks.

In the foreground, there are cherry blossom
The image features a beautiful scene with cherry blossoms in the foreground, framing a tall, modern tower in the background. The cherry blossoms are in full bloom, with delicate pink flowers covering the branches, creating a picturesque and serene atmosphere. The tower in the background appears to be a significant architectural structure, possibly
The image features a beautiful scene with cherry blossoms in full bloom in the foreground, framing a tall, modern tower in the background. The cherry blossoms, with their delicate pink and white flowers, create a picturesque and serene atmosphere. The tower, which appears to be a significant landmark, stands prominently against a
The image depicts a scene with cherry blossoms in full bloom, creating a picturesque and vibrant foreground. The delicate pink flowers are prominently displayed, framing the view and adding a sense of natural beauty and tranquility. In the background, there is a tall, modern tower with a distinctive architectural design, featuring a spherical observation

Offline multi-image example

Script used for simple testing of multi-image:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

model_name = "mistral-community/pixtral-12b"
llm = LLM(
    model=model_name, 
    max_num_seqs=1, 
    enforce_eager=True, 
    max_model_len=10000, 
    limit_mm_per_prompt={"image": 2}
)

image1 = ImageAsset("cherry_blossom").pil_image.convert("RGB")
image2 = ImageAsset("stop_sign").pil_image.convert("RGB")
inputs = {
    "prompt": f"<s>[INST]Describe the images.\n[IMG][IMG][/INST]",
    "multi_modal_data": {
        "image": [image1, image2]
    },
}
outputs = llm.generate(inputs, sampling_params=SamplingParams(temperature=0.0, max_tokens=200))

print(outputs[0].outputs[0].text)

Output:

The first image depicts a beautiful scene with cherry blossoms in full bloom, framing a tall, modern tower in the background. The cherry blossoms, with their delicate pink flowers, create a picturesque foreground against a clear blue sky. The tower, likely an observation or communication structure, stands prominently in the center, adding a contrast between natural beauty and modern architecture.

The second image shows an urban street scene with a stop sign in the foreground. The stop sign is positioned in front of a traditional Chinese archway, which is decorated with red and gold colors and Chinese characters. The archway leads into a Chinatown area, as indicated by the signage and the architectural style. There is a black SUV driving past the archway, and the street is lined with various shops and buildings, including a visible Optus store. The scene captures a blend of traditional and modern elements within an urban setting.

Offline chat example

Script used for testing of chat templating:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.multimodal.utils import encode_image_base64

def image_url(asset: str):
    image = ImageAsset(asset)
    base64 = encode_image_base64(image.pil_image)
    return f"data:image/jpeg;base64,{base64}"

model_name = "mistral-community/pixtral-12b"
llm = LLM(
    model=model_name,
    max_num_seqs=1,
    enforce_eager=True,
    max_model_len=10000,
)

chat_template = "{%- if messages[0][\"role\"] == \"system\" %}\n    {%- set system_message = messages[0][\"content\"] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n        {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n    {%- endif %}\n    {%- if message[\"role\"] == \"user\" %}\n        {%- if loop.last and system_message is defined %}\n            {{- \"[INST]\" + system_message + \"\n\n\" }}\n        {%- else %}\n            {{- \"[INST]\" }}\n        {%- endif %}\n        {%- if message[\"content\"] is not string %}\n            {%- for chunk in message[\"content\"] %}\n                {%- if chunk[\"type\"] == \"text\" %}\n                    {{- chunk[\"content\"] }}\n                {%- elif chunk[\"type\"] == \"image\" %}\n                    {{- \"[IMG]\" }}\n                {%- else %}\n                    {{- raise_exception(\"Unrecognized content type!\") }}\n                {%- endif %}\n            {%- endfor %}\n        {%- else %}\n            {{- message[\"content\"] }}\n        {%- endif %}\n        {{- \"[/INST]\" }}\n    {%- elif message[\"role\"] == \"assistant\" %}\n        {{- message[\"content\"] + eos_token}}\n    {%- else %}\n        {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n    {%- endif %}\n{%- endfor %}"  # noqa
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image."},
            {"type": "image_url", "image_url": {"url": image_url("stop_sign")}},
        ],
    },
]
outputs = llm.chat(messages,
                   sampling_params=SamplingParams(temperature=0.0, max_tokens=100),
                   chat_template=chat_template)

print(outputs[0].outputs[0].text)

Output:

The image depicts a street scene in what appears to be a Chinatown. Prominently in the foreground is a red "STOP" sign. Behind the sign, there is a traditional Chinese archway with intricate designs and Chinese characters. The archway is painted in vibrant colors, predominantly red and gold. 

To the right of the archway, there is a black SUV driving on the road. The street is lined with various shops and businesses, some of which have signs in both

Copy link

github-actions bot commented Oct 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@wuxiyiye
Copy link

wuxiyiye commented Oct 9, 2024

Hi @mgoin , thanks for your contribution! will you continue to fix the PR?

@mgoin
Copy link
Member Author

mgoin commented Oct 9, 2024

@wuxiyiye I'm slowly working through the issues but it is quite a lot due to poor reuse of existing Llava features. I would greatly appreciate if others would have bandwidth to work on this

@mgoin mgoin changed the title Support Pixtral models in the HF Transformers format [WIP] Support Pixtral models in the HF Transformers format Oct 10, 2024
@mgoin mgoin marked this pull request as ready for review October 16, 2024 15:22
@mgoin mgoin changed the title [WIP] Support Pixtral models in the HF Transformers format [Model] Support Pixtral models in the HF Transformers format Oct 16, 2024
@mgoin
Copy link
Member Author

mgoin commented Oct 16, 2024

Also I have verified that an FP8 checkpoint loads and produces good output:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

model_name = "nm-testing/pixtral-12b-FP8-dynamic"
llm = LLM(
    model=model_name, 
    max_num_seqs=1, 
    enforce_eager=True, 
    max_model_len=10000, 
    limit_mm_per_prompt={"image": 2}
)

image1 = ImageAsset("cherry_blossom").pil_image.convert("RGB")
image2 = ImageAsset("stop_sign").pil_image.convert("RGB")
inputs = {
    "prompt": f"<s>[INST]Describe the images.\n[IMG][IMG][/INST]",
    "multi_modal_data": {
        "image": [image1, image2]
    },
}
outputs = llm.generate(inputs, sampling_params=SamplingParams(temperature=0.0, max_tokens=200))

print(outputs[0].outputs[0].text)
The image on the left depicts a tall, modern tower with a unique architectural design, partially obscured by cherry blossom trees in full bloom. The blossoms are vibrant and abundant, creating a picturesque scene against a clear blue sky. The tower appears to be a significant landmark, possibly a telecommunications or observation tower, given its height and structure.

The image on the right shows a street scene in what appears to be a Chinatown district. Prominent in the foreground is a red stop sign with white lettering. Behind the stop sign, there is an ornate, traditional Chinese archway with red and gold decorations and Chinese characters. The archway frames a street lined with various shops and businesses, including a visible sign for "Optus." A black SUV is driving through the intersection, and there are bollards and a tree in the vicinity. The overall atmosphere suggests a blend of traditional and modern elements in an urban setting.

@mgoin mgoin requested a review from DarkLight1337 October 17, 2024 14:01
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your hard work! Some initial comments.

vllm/model_executor/models/pixtral.py Show resolved Hide resolved
vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved
vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved
vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved
vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved
vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved
Comment on lines 705 to 719
replace_tokens = [[processor.image_token] * num_width_tokens +
[processor.image_break_token]] * num_height_tokens
# Flatten list
replace_tokens = [
item for sublist in replace_tokens for item in sublist
]
replace_tokens[-1] = processor.image_end_token
replace_str = "".join(replace_tokens)
replace_strings.append(replace_str)
new_prompt = new_prompt.replace(processor.image_token, "<placeholder>",
1)

while "<placeholder>" in new_prompt:
replace_str = replace_strings.pop(0)
new_prompt = new_prompt.replace("<placeholder>", replace_str, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the prompt, this may be quite expensive. I suggest using the more optimized vllm.multimodal.utils.repeat_and_pad_placeholder_tokens function.

Copy link
Member Author

@mgoin mgoin Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with using repeat_and_pad_placeholder_tokens is that we need to insert image_break_token at the end of every row and image_end_token at the end, along with multiple different sized images in a prompt. I think we can optimize this later with a new implementation that can support this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let's do it in another PR then. We should also TP the model in the future.

vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2024
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Please see my comment above though.

Also, we should add the HF version to our list of supported models.

@mgoin mgoin merged commit 3921a2f into main Oct 18, 2024
60 checks passed
@rebel-jonghewk
Copy link

@mgoin I'm trying to run LlavaNextForConditionalGeneration on a non-CUDA hardware platform, but when I import llava_next, it pulls in pixtral, which in turn imports xformers. From my understanding, xformers is required only for CUDA support. Is there a way to avoid this dependency or run LlavaNextForConditionalGeneration without xformers on non-CUDA platforms?

@pratyush0599
Copy link

pratyush0599 commented Oct 22, 2024

Also I have verified that an FP8 checkpoint loads and produces good output:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

model_name = "nm-testing/pixtral-12b-FP8-dynamic"
llm = LLM(
    model=model_name, 
    max_num_seqs=1, 
    enforce_eager=True, 
    max_model_len=10000, 
    limit_mm_per_prompt={"image": 2}
)

image1 = ImageAsset("cherry_blossom").pil_image.convert("RGB")
image2 = ImageAsset("stop_sign").pil_image.convert("RGB")
inputs = {
    "prompt": f"<s>[INST]Describe the images.\n[IMG][IMG][/INST]",
    "multi_modal_data": {
        "image": [image1, image2]
    },
}
outputs = llm.generate(inputs, sampling_params=SamplingParams(temperature=0.0, max_tokens=200))

print(outputs[0].outputs[0].text)
The image on the left depicts a tall, modern tower with a unique architectural design, partially obscured by cherry blossom trees in full bloom. The blossoms are vibrant and abundant, creating a picturesque scene against a clear blue sky. The tower appears to be a significant landmark, possibly a telecommunications or observation tower, given its height and structure.

The image on the right shows a street scene in what appears to be a Chinatown district. Prominent in the foreground is a red stop sign with white lettering. Behind the stop sign, there is an ornate, traditional Chinese archway with red and gold decorations and Chinese characters. The archway frames a street lined with various shops and businesses, including a visible sign for "Optus." A black SUV is driving through the intersection, and there are bollards and a tree in the vicinity. The overall atmosphere suggests a blend of traditional and modern elements in an urban setting.

Great work on this issue guys! However, I was wondering why "nm-testing/pixtral-12b-FP8-dynamic" is supported by vllm and "SeanScripts/pixtral-12b-nf4" (uses bitsandbytes) isn't. I get the same error as mentioned in FIX #9069 .Thoughts?
llm = LLM( model="SeanScripts/pixtral-12b-nf4", max_num_seqs=1, enforce_eager=True, max_model_len=10000, quantization="bitsandbytes", load_format="bitsandbytes" )

Error Details

INFO 10-22 09:09:17 config.py:1700] Downcasting torch.float32 to torch.float16. WARNING 10-22 09:09:24 config.py:361] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 10-22 09:09:24 config.py:435] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 10-22 09:09:24 llm_engine.py:238] Initializing an LLM engine (v0.6.3.post2.dev37+g696b01af) with config: model='SeanScripts/pixtral-12b-nf4', speculative_config=None, tokenizer='SeanScripts/pixtral-12b-nf4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=SeanScripts/pixtral-12b-nf4, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None) INFO 10-22 09:09:27 model_runner.py:1055] Starting to load model SeanScripts/pixtral-12b-nf4... /opt/conda/envs/prats/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /opt/conda/envs/prats/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd")

AttributeError Traceback (most recent call last)
Cell In[1], line 5
2 from vllm.assets.image import ImageAsset
4 model_name = "SeanScripts/pixtral-12b-nf4"
----> 5 llm = LLM(
6 model=model_name,
7 max_num_seqs=1,
8 enforce_eager=True,
9 max_model_len=10000,
10 quantization="bitsandbytes",
11 load_format="bitsandbytes"
12 )

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/utils.py:1073, in deprecate_args..wrapper..inner(*args, **kwargs)
1066 msg += f" {additional_message}"
1068 warnings.warn(
1069 DeprecationWarning(msg),
1070 stacklevel=3, # The inner function takes up one level
1071 )
-> 1073 return fn(*args, **kwargs)

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/entrypoints/llm.py:193, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, task, **kwargs)
167 kwargs["disable_log_stats"] = True
169 engine_args = EngineArgs(
170 model=model,
171 task=task,
(...)
191 **kwargs,
192 )
--> 193 self.llm_engine = LLMEngine.from_engine_args(
194 engine_args, usage_context=UsageContext.LLM_CLASS)
195 self.request_counter = Counter()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/engine/llm_engine.py:574, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
572 executor_class = cls._get_executor_cls(engine_config)
573 # Create the LLM engine.
--> 574 engine = cls(
575 **engine_config.to_dict(),
576 executor_class=executor_class,
577 log_stats=not engine_args.disable_log_stats,
578 usage_context=usage_context,
579 stat_loggers=stat_loggers,
580 )
582 return engine

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/engine/llm_engine.py:335, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, use_cached_outputs)
331 self.input_registry = input_registry
332 self.input_processor = input_registry.create_input_processor(
333 model_config)
--> 335 self.model_executor = executor_class(
336 model_config=model_config,
337 cache_config=cache_config,
338 parallel_config=parallel_config,
339 scheduler_config=scheduler_config,
340 device_config=device_config,
341 lora_config=lora_config,
342 speculative_config=speculative_config,
343 load_config=load_config,
344 prompt_adapter_config=prompt_adapter_config,
345 observability_config=self.observability_config,
346 )
348 if self.model_config.task != "embedding":
349 self._initialize_kv_caches()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/executor/executor_base.py:47, in ExecutorBase.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
45 self.prompt_adapter_config = prompt_adapter_config
46 self.observability_config = observability_config
---> 47 self._init_executor()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:40, in GPUExecutor._init_executor(self)
38 self.driver_worker = self._create_worker()
39 self.driver_worker.init_device()
---> 40 self.driver_worker.load_model()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/worker/worker.py:180, in Worker.load_model(self)
179 def load_model(self):
--> 180 self.model_runner.load_model()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/worker/model_runner.py:1057, in GPUModelRunnerBase.load_model(self)
1055 logger.info("Starting to load model %s...", self.model_config.model)
1056 with DeviceMemoryProfiler() as m:
-> 1057 self.model = get_model(model_config=self.model_config,
1058 device_config=self.device_config,
1059 load_config=self.load_config,
1060 lora_config=self.lora_config,
1061 parallel_config=self.parallel_config,
1062 scheduler_config=self.scheduler_config,
1063 cache_config=self.cache_config)
1065 self.model_memory_usage = m.consumed_memory
1066 logger.info("Loading model weights took %.4f GB",
1067 self.model_memory_usage / float(2**30))

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/model_executor/model_loader/init.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, cache_config)
13 def get_model(*, model_config: ModelConfig, load_config: LoadConfig,
14 device_config: DeviceConfig, parallel_config: ParallelConfig,
15 scheduler_config: SchedulerConfig,
16 lora_config: Optional[LoRAConfig],
17 cache_config: CacheConfig) -> nn.Module:
18 loader = get_model_loader(load_config)
---> 19 return loader.load_model(model_config=model_config,
20 device_config=device_config,
21 lora_config=lora_config,
22 parallel_config=parallel_config,
23 scheduler_config=scheduler_config,
24 cache_config=cache_config)

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py:1148, in BitsAndBytesModelLoader.load_model(self, model_config, device_config, lora_config, parallel_config, scheduler_config, cache_config)
1144 with torch.device(device_config.device):
1145 model = _initialize_model(model_config, self.load_config,
1146 lora_config, cache_config)
-> 1148 self._load_weights(model_config, model)
1150 return model.eval()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py:1033, in BitsAndBytesModelLoader._load_weights(self, model_config, model)
1028 raise AttributeError(
1029 "The required method 'load_weights' is not defined in class"
1030 f" {type(model).name}.")
1032 if not hasattr(model, 'bitsandbytes_stacked_params_mapping'):
-> 1033 raise AttributeError(
1034 f"Model {type(model).name} does not support BitsAndBytes "
1035 "quantization yet.")
1037 if len(self.target_modules) == 0:
1038 if hasattr(model, 'default_bitsandbytes_target_modules'):

AttributeError: Model LlavaForConditionalGeneration does not support BitsAndBytes quantization yet.

@mgoin
Copy link
Member Author

mgoin commented Oct 22, 2024

@rebel-jonghewk Ah thanks for reporting this issue. I was going to work on making a non-xformers backend for Pixtral, but in the meantime I can at least make the import lazy to solve your issue.

@pratyush0599 I'll need to look into that model checkpoint, will do. For now you should be able to use the in-flight bnb quant with the "--quantization bitsandbytes" flag

@pratyush0599
Copy link

@mgoin Hey, thanks for the prompt reply I tried using vllm serve and the in-flight quantization for original pixtral model ("mistralai/Pixtral-12B-2409") and got the same error. I tried on both models as one uses LlavafoConditionalGeneration and the other uses PixtralForConditionalGeneration but I am receiving the same error as above.This was my code.:
LLM(#"mistral-community/pixtral-12b", "mistralai/Pixtral-12B-2409", quantization="bitsandbytes", load_format="bitsandbytes", dtype=torch.bfloat16, tokenizer_mode='mistral', trust_remote_code=True)

@DarkLight1337 DarkLight1337 deleted the support-pixtral-hf-format branch October 23, 2024 12:22
charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024
sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024
mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024
tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
6 participants