[Model] Support Pixtral models in the HF Transformers format #9036

mgoin · 2024-10-03T07:06:59Z

Introduces PixtralHF, which is a model implementing HF's format of Pixtral. Based off https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/modeling_pixtral.py

Tested with:

This model implementation follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders. The model uses [PixtralVisionModel] for its vision encoder, and [MistralForCausalLM] for its language decoder.

Example output from python examples/offline_inference_vision_language.py --model pixtral_hf:

The image features a prominent structure in the background, which is the Tokyo Skytree, a broadcasting and observation tower located in Tokyo, Japan. The Tokyo Skytree is the tallest tower in the world and is known for its distinctive lattice structure and spherical observation decks.

In the foreground, there are cherry blossom
The image features a beautiful scene with cherry blossoms in the foreground, framing a tall, modern tower in the background. The cherry blossoms are in full bloom, with delicate pink flowers covering the branches, creating a picturesque and serene atmosphere. The tower in the background appears to be a significant architectural structure, possibly
The image features a beautiful scene with cherry blossoms in full bloom in the foreground, framing a tall, modern tower in the background. The cherry blossoms, with their delicate pink and white flowers, create a picturesque and serene atmosphere. The tower, which appears to be a significant landmark, stands prominently against a
The image depicts a scene with cherry blossoms in full bloom, creating a picturesque and vibrant foreground. The delicate pink flowers are prominently displayed, framing the view and adding a sense of natural beauty and tranquility. In the background, there is a tall, modern tower with a distinctive architectural design, featuring a spherical observation

Offline multi-image example

Script used for simple testing of multi-image:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

model_name = "mistral-community/pixtral-12b"
llm = LLM(
    model=model_name, 
    max_num_seqs=1, 
    enforce_eager=True, 
    max_model_len=10000, 
    limit_mm_per_prompt={"image": 2}
)

image1 = ImageAsset("cherry_blossom").pil_image.convert("RGB")
image2 = ImageAsset("stop_sign").pil_image.convert("RGB")
inputs = {
    "prompt": f"<s>[INST]Describe the images.\n[IMG][IMG][/INST]",
    "multi_modal_data": {
        "image": [image1, image2]
    },
}
outputs = llm.generate(inputs, sampling_params=SamplingParams(temperature=0.0, max_tokens=200))

print(outputs[0].outputs[0].text)

Output:

The first image depicts a beautiful scene with cherry blossoms in full bloom, framing a tall, modern tower in the background. The cherry blossoms, with their delicate pink flowers, create a picturesque foreground against a clear blue sky. The tower, likely an observation or communication structure, stands prominently in the center, adding a contrast between natural beauty and modern architecture.

The second image shows an urban street scene with a stop sign in the foreground. The stop sign is positioned in front of a traditional Chinese archway, which is decorated with red and gold colors and Chinese characters. The archway leads into a Chinatown area, as indicated by the signage and the architectural style. There is a black SUV driving past the archway, and the street is lined with various shops and buildings, including a visible Optus store. The scene captures a blend of traditional and modern elements within an urban setting.

Offline chat example

Script used for testing of chat templating:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.multimodal.utils import encode_image_base64

def image_url(asset: str):
    image = ImageAsset(asset)
    base64 = encode_image_base64(image.pil_image)
    return f"data:image/jpeg;base64,{base64}"

model_name = "mistral-community/pixtral-12b"
llm = LLM(
    model=model_name,
    max_num_seqs=1,
    enforce_eager=True,
    max_model_len=10000,
)

chat_template = "{%- if messages[0][\"role\"] == \"system\" %}\n    {%- set system_message = messages[0][\"content\"] %}\n    {%- set loop_messages = messages[1:] %}\n{%- else %}\n    {%- set loop_messages = messages %}\n{%- endif %}\n\n{{- bos_token }}\n{%- for message in loop_messages %}\n    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}\n        {{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}\n    {%- endif %}\n    {%- if message[\"role\"] == \"user\" %}\n        {%- if loop.last and system_message is defined %}\n            {{- \"[INST]\" + system_message + \"\n\n\" }}\n        {%- else %}\n            {{- \"[INST]\" }}\n        {%- endif %}\n        {%- if message[\"content\"] is not string %}\n            {%- for chunk in message[\"content\"] %}\n                {%- if chunk[\"type\"] == \"text\" %}\n                    {{- chunk[\"content\"] }}\n                {%- elif chunk[\"type\"] == \"image\" %}\n                    {{- \"[IMG]\" }}\n                {%- else %}\n                    {{- raise_exception(\"Unrecognized content type!\") }}\n                {%- endif %}\n            {%- endfor %}\n        {%- else %}\n            {{- message[\"content\"] }}\n        {%- endif %}\n        {{- \"[/INST]\" }}\n    {%- elif message[\"role\"] == \"assistant\" %}\n        {{- message[\"content\"] + eos_token}}\n    {%- else %}\n        {{- raise_exception(\"Only user and assistant roles are supported, with the exception of an initial optional system message!\") }}\n    {%- endif %}\n{%- endfor %}"  # noqa
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the image."},
            {"type": "image_url", "image_url": {"url": image_url("stop_sign")}},
        ],
    },
]
outputs = llm.chat(messages,
                   sampling_params=SamplingParams(temperature=0.0, max_tokens=100),
                   chat_template=chat_template)

print(outputs[0].outputs[0].text)

Output:

The image depicts a street scene in what appears to be a Chinatown. Prominently in the foreground is a red "STOP" sign. Behind the sign, there is a traditional Chinese archway with intricate designs and Chinese characters. The archway is painted in vibrant colors, predominantly red and gold. 

To the right of the archway, there is a black SUV driving on the road. The street is lined with various shops and businesses, some of which have signs in both

github-actions · 2024-10-03T07:07:12Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

wuxiyiye · 2024-10-09T12:17:44Z

Hi @mgoin , thanks for your contribution! will you continue to fix the PR?

mgoin · 2024-10-09T17:42:50Z

@wuxiyiye I'm slowly working through the issues but it is quite a lot due to poor reuse of existing Llava features. I would greatly appreciate if others would have bandwidth to work on this

mgoin · 2024-10-16T20:38:27Z

Also I have verified that an FP8 checkpoint loads and produces good output:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

model_name = "nm-testing/pixtral-12b-FP8-dynamic"
llm = LLM(
    model=model_name, 
    max_num_seqs=1, 
    enforce_eager=True, 
    max_model_len=10000, 
    limit_mm_per_prompt={"image": 2}
)

image1 = ImageAsset("cherry_blossom").pil_image.convert("RGB")
image2 = ImageAsset("stop_sign").pil_image.convert("RGB")
inputs = {
    "prompt": f"<s>[INST]Describe the images.\n[IMG][IMG][/INST]",
    "multi_modal_data": {
        "image": [image1, image2]
    },
}
outputs = llm.generate(inputs, sampling_params=SamplingParams(temperature=0.0, max_tokens=200))

print(outputs[0].outputs[0].text)

The image on the left depicts a tall, modern tower with a unique architectural design, partially obscured by cherry blossom trees in full bloom. The blossoms are vibrant and abundant, creating a picturesque scene against a clear blue sky. The tower appears to be a significant landmark, possibly a telecommunications or observation tower, given its height and structure.

The image on the right shows a street scene in what appears to be a Chinatown district. Prominent in the foreground is a red stop sign with white lettering. Behind the stop sign, there is an ornate, traditional Chinese archway with red and gold decorations and Chinese characters. The archway frames a street lined with various shops and businesses, including a visible sign for "Optus." A black SUV is driving through the intersection, and there are bollards and a tree in the vicinity. The overall atmosphere suggests a blend of traditional and modern elements in an urban setting.

DarkLight1337

Thanks for your hard work! Some initial comments.

vllm/model_executor/models/pixtral.py

DarkLight1337 · 2024-10-17T14:30:29Z

vllm/model_executor/models/pixtral.py

+        replace_tokens = [[processor.image_token] * num_width_tokens +
+                          [processor.image_break_token]] * num_height_tokens
+        # Flatten list
+        replace_tokens = [
+            item for sublist in replace_tokens for item in sublist
+        ]
+        replace_tokens[-1] = processor.image_end_token
+        replace_str = "".join(replace_tokens)
+        replace_strings.append(replace_str)
+        new_prompt = new_prompt.replace(processor.image_token, "<placeholder>",
+                                        1)
+
+    while "<placeholder>" in new_prompt:
+        replace_str = replace_strings.pop(0)
+        new_prompt = new_prompt.replace("<placeholder>", replace_str, 1)


Depending on the prompt, this may be quite expensive. I suggest using the more optimized vllm.multimodal.utils.repeat_and_pad_placeholder_tokens function.

The issue with using repeat_and_pad_placeholder_tokens is that we need to insert image_break_token at the end of every row and image_end_token at the end, along with multiple different sized images in a prompt. I think we can optimize this later with a new implementation that can support this

I see, let's do it in another PR then. We should also TP the model in the future.

vllm/model_executor/models/pixtral.py

DarkLight1337

Overall looks good. Please see my comment above though.

Also, we should add the HF version to our list of supported models.

rebel-jonghewk · 2024-10-22T02:21:55Z

@mgoin I'm trying to run LlavaNextForConditionalGeneration on a non-CUDA hardware platform, but when I import llava_next, it pulls in pixtral, which in turn imports xformers. From my understanding, xformers is required only for CUDA support. Is there a way to avoid this dependency or run LlavaNextForConditionalGeneration without xformers on non-CUDA platforms?

pratyush0599 · 2024-10-22T08:57:18Z

Also I have verified that an FP8 checkpoint loads and produces good output:

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

model_name = "nm-testing/pixtral-12b-FP8-dynamic"
llm = LLM(
    model=model_name, 
    max_num_seqs=1, 
    enforce_eager=True, 
    max_model_len=10000, 
    limit_mm_per_prompt={"image": 2}
)

image1 = ImageAsset("cherry_blossom").pil_image.convert("RGB")
image2 = ImageAsset("stop_sign").pil_image.convert("RGB")
inputs = {
    "prompt": f"<s>[INST]Describe the images.\n[IMG][IMG][/INST]",
    "multi_modal_data": {
        "image": [image1, image2]
    },
}
outputs = llm.generate(inputs, sampling_params=SamplingParams(temperature=0.0, max_tokens=200))

print(outputs[0].outputs[0].text)

The image on the left depicts a tall, modern tower with a unique architectural design, partially obscured by cherry blossom trees in full bloom. The blossoms are vibrant and abundant, creating a picturesque scene against a clear blue sky. The tower appears to be a significant landmark, possibly a telecommunications or observation tower, given its height and structure.

The image on the right shows a street scene in what appears to be a Chinatown district. Prominent in the foreground is a red stop sign with white lettering. Behind the stop sign, there is an ornate, traditional Chinese archway with red and gold decorations and Chinese characters. The archway frames a street lined with various shops and businesses, including a visible sign for "Optus." A black SUV is driving through the intersection, and there are bollards and a tree in the vicinity. The overall atmosphere suggests a blend of traditional and modern elements in an urban setting.

Great work on this issue guys! However, I was wondering why "nm-testing/pixtral-12b-FP8-dynamic" is supported by vllm and "SeanScripts/pixtral-12b-nf4" (uses bitsandbytes) isn't. I get the same error as mentioned in FIX #9069 .Thoughts?
llm = LLM( model="SeanScripts/pixtral-12b-nf4", max_num_seqs=1, enforce_eager=True, max_model_len=10000, quantization="bitsandbytes", load_format="bitsandbytes" )

Error Details

INFO 10-22 09:09:17 config.py:1700] Downcasting torch.float32 to torch.float16. WARNING 10-22 09:09:24 config.py:361] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING 10-22 09:09:24 config.py:435] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 10-22 09:09:24 llm_engine.py:238] Initializing an LLM engine (v0.6.3.post2.dev37+g696b01af) with config: model='SeanScripts/pixtral-12b-nf4', speculative_config=None, tokenizer='SeanScripts/pixtral-12b-nf4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=SeanScripts/pixtral-12b-nf4, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None) INFO 10-22 09:09:27 model_runner.py:1055] Starting to load model SeanScripts/pixtral-12b-nf4... /opt/conda/envs/prats/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /opt/conda/envs/prats/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd")

AttributeError Traceback (most recent call last)
Cell In[1], line 5
2 from vllm.assets.image import ImageAsset
4 model_name = "SeanScripts/pixtral-12b-nf4"
----> 5 llm = LLM(
6 model=model_name,
7 max_num_seqs=1,
8 enforce_eager=True,
9 max_model_len=10000,
10 quantization="bitsandbytes",
11 load_format="bitsandbytes"
12 )

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/utils.py:1073, in deprecate_args..wrapper..inner(*args, **kwargs)
1066 msg += f" {additional_message}"
1068 warnings.warn(
1069 DeprecationWarning(msg),
1070 stacklevel=3, # The inner function takes up one level
1071 )
-> 1073 return fn(*args, **kwargs)

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/entrypoints/llm.py:193, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, task, **kwargs)
167 kwargs["disable_log_stats"] = True
169 engine_args = EngineArgs(
170 model=model,
171 task=task,
(...)
191 **kwargs,
192 )
--> 193 self.llm_engine = LLMEngine.from_engine_args(
194 engine_args, usage_context=UsageContext.LLM_CLASS)
195 self.request_counter = Counter()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/engine/llm_engine.py:574, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers)
572 executor_class = cls._get_executor_cls(engine_config)
573 # Create the LLM engine.
--> 574 engine = cls(
575 **engine_config.to_dict(),
576 executor_class=executor_class,
577 log_stats=not engine_args.disable_log_stats,
578 usage_context=usage_context,
579 stat_loggers=stat_loggers,
580 )
582 return engine

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/engine/llm_engine.py:335, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, decoding_config, observability_config, prompt_adapter_config, executor_class, log_stats, usage_context, stat_loggers, input_registry, use_cached_outputs)
331 self.input_registry = input_registry
332 self.input_processor = input_registry.create_input_processor(
333 model_config)
--> 335 self.model_executor = executor_class(
336 model_config=model_config,
337 cache_config=cache_config,
338 parallel_config=parallel_config,
339 scheduler_config=scheduler_config,
340 device_config=device_config,
341 lora_config=lora_config,
342 speculative_config=speculative_config,
343 load_config=load_config,
344 prompt_adapter_config=prompt_adapter_config,
345 observability_config=self.observability_config,
346 )
348 if self.model_config.task != "embedding":
349 self._initialize_kv_caches()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/executor/executor_base.py:47, in ExecutorBase.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, load_config, lora_config, speculative_config, prompt_adapter_config, observability_config)
45 self.prompt_adapter_config = prompt_adapter_config
46 self.observability_config = observability_config
---> 47 self._init_executor()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:40, in GPUExecutor._init_executor(self)
38 self.driver_worker = self._create_worker()
39 self.driver_worker.init_device()
---> 40 self.driver_worker.load_model()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/worker/worker.py:180, in Worker.load_model(self)
179 def load_model(self):
--> 180 self.model_runner.load_model()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/worker/model_runner.py:1057, in GPUModelRunnerBase.load_model(self)
1055 logger.info("Starting to load model %s...", self.model_config.model)
1056 with DeviceMemoryProfiler() as m:
-> 1057 self.model = get_model(model_config=self.model_config,
1058 device_config=self.device_config,
1059 load_config=self.load_config,
1060 lora_config=self.lora_config,
1061 parallel_config=self.parallel_config,
1062 scheduler_config=self.scheduler_config,
1063 cache_config=self.cache_config)
1065 self.model_memory_usage = m.consumed_memory
1066 logger.info("Loading model weights took %.4f GB",
1067 self.model_memory_usage / float(2**30))

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/model_executor/model_loader/init.py:19, in get_model(model_config, load_config, device_config, parallel_config, scheduler_config, lora_config, cache_config)
13 def get_model(*, model_config: ModelConfig, load_config: LoadConfig,
14 device_config: DeviceConfig, parallel_config: ParallelConfig,
15 scheduler_config: SchedulerConfig,
16 lora_config: Optional[LoRAConfig],
17 cache_config: CacheConfig) -> nn.Module:
18 loader = get_model_loader(load_config)
---> 19 return loader.load_model(model_config=model_config,
20 device_config=device_config,
21 lora_config=lora_config,
22 parallel_config=parallel_config,
23 scheduler_config=scheduler_config,
24 cache_config=cache_config)

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py:1148, in BitsAndBytesModelLoader.load_model(self, model_config, device_config, lora_config, parallel_config, scheduler_config, cache_config)
1144 with torch.device(device_config.device):
1145 model = _initialize_model(model_config, self.load_config,
1146 lora_config, cache_config)
-> 1148 self._load_weights(model_config, model)
1150 return model.eval()

File /opt/conda/envs/prats/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py:1033, in BitsAndBytesModelLoader._load_weights(self, model_config, model)
1028 raise AttributeError(
1029 "The required method 'load_weights' is not defined in class"
1030 f" {type(model).name}.")
1032 if not hasattr(model, 'bitsandbytes_stacked_params_mapping'):
-> 1033 raise AttributeError(
1034 f"Model {type(model).name} does not support BitsAndBytes "
1035 "quantization yet.")
1037 if len(self.target_modules) == 0:
1038 if hasattr(model, 'default_bitsandbytes_target_modules'):

AttributeError: Model LlavaForConditionalGeneration does not support BitsAndBytes quantization yet.

mgoin · 2024-10-22T10:35:02Z

@rebel-jonghewk Ah thanks for reporting this issue. I was going to work on making a non-xformers backend for Pixtral, but in the meantime I can at least make the import lazy to solve your issue.

@pratyush0599 I'll need to look into that model checkpoint, will do. For now you should be able to use the in-flight bnb quant with the "--quantization bitsandbytes" flag

pratyush0599 · 2024-10-22T11:04:53Z

@mgoin Hey, thanks for the prompt reply I tried using vllm serve and the in-flight quantization for original pixtral model ("mistralai/Pixtral-12B-2409") and got the same error. I tried on both models as one uses LlavafoConditionalGeneration and the other uses PixtralForConditionalGeneration but I am receiving the same error as above.This was my code.:
LLM(#"mistral-community/pixtral-12b", "mistralai/Pixtral-12B-2409", quantization="bitsandbytes", load_format="bitsandbytes", dtype=torch.bfloat16, tokenizer_mode='mistral', trust_remote_code=True)

…oject#9036) Signed-off-by: charlifu <[email protected]>

…oject#9036) Signed-off-by: Vinay Damodaran <[email protected]>

…oject#9036) Signed-off-by: Alvant <[email protected]>

…oject#9036) Signed-off-by: Amit Garg <[email protected]>

…oject#9036) Signed-off-by: qishuai <[email protected]>

…oject#9036) Signed-off-by: Sumit Dubey <[email protected]>

…oject#9036)

…oject#9036) Signed-off-by: Maxime Fournioux <[email protected]>

…oject#9036) Signed-off-by: Tyler Michael Smith <[email protected]>

Support Pixtral models in the HF Transformers format

d88b49f

mgoin mentioned this pull request Oct 3, 2024

[New Model][Format]: Support the HF-version of Pixtral #8685

Closed

1 task

DarkLight1337 mentioned this pull request Oct 4, 2024

[Bug]: Issue with Pixtral Model: Unsupported Vision Configuration in vLLM ( AMD amd 7900 xtx) #9069

Closed

1 task

Merge branch 'main' into support-pixtral-hf-format

c939028

mgoin changed the title ~~Support Pixtral models in the HF Transformers format~~ [WIP] Support Pixtral models in the HF Transformers format Oct 10, 2024

mgoin added 7 commits October 10, 2024 19:35

Working for cherry blossom?

6df111e

Test script

69f47fa

Pixtral is basically working!

c0f815d

Format

cdc075d

Merge branch 'main' into support-pixtral-hf-format

9cc49d4

Merge and format

592b692

Clean up

7279180

mgoin marked this pull request as ready for review October 16, 2024 15:22

mgoin changed the title ~~[WIP] Support Pixtral models in the HF Transformers format~~ [Model] Support Pixtral models in the HF Transformers format Oct 16, 2024

mgoin added 2 commits October 16, 2024 15:57

Remove flatten_bn change

f1cc569

Better comments

466ea3e

mgoin requested a review from DarkLight1337 October 17, 2024 14:01

DarkLight1337 approved these changes Oct 17, 2024

View reviewed changes

Review comments

7f1eec3

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2024

Merge branch 'main' into support-pixtral-hf-format

bd88321

DarkLight1337 reviewed Oct 18, 2024

View reviewed changes

vllm/model_executor/models/pixtral.py Outdated Show resolved Hide resolved

DarkLight1337 approved these changes Oct 18, 2024

View reviewed changes

mgoin added 2 commits October 18, 2024 16:01

Fix new_token_ids

a8c0f35

Reuse HF PixtralRotaryEmbedding

775ff5e

mgoin merged commit 3921a2f into main Oct 18, 2024
60 checks passed

DarkLight1337 deleted the support-pixtral-hf-format branch October 23, 2024 12:22

DarkLight1337 mentioned this pull request Oct 23, 2024

[Bugfix] Remove xformers requirement for Pixtral #9597

Merged

charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

1ecfb58

…oject#9036) Signed-off-by: charlifu <[email protected]>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

e0e8904

…oject#9036) Signed-off-by: Vinay Damodaran <[email protected]>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

0fa30c6

…oject#9036) Signed-off-by: Alvant <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

bd51935

…oject#9036) Signed-off-by: Amit Garg <[email protected]>

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

00e8353

…oject#9036) Signed-off-by: qishuai <[email protected]>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

9a99e25

…oject#9036) Signed-off-by: Sumit Dubey <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

37d3766

…oject#9036)

mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

002d7af

…oject#9036) Signed-off-by: Maxime Fournioux <[email protected]>

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[Model] Support Pixtral models in the HF Transformers format (vllm-pr…

0033b30

…oject#9036) Signed-off-by: Tyler Michael Smith <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Support Pixtral models in the HF Transformers format #9036

[Model] Support Pixtral models in the HF Transformers format #9036

mgoin commented Oct 3, 2024 •

edited

Loading

github-actions bot commented Oct 3, 2024

wuxiyiye commented Oct 9, 2024

mgoin commented Oct 9, 2024

mgoin commented Oct 16, 2024

DarkLight1337 left a comment

DarkLight1337 Oct 17, 2024

mgoin Oct 17, 2024 •

edited

Loading

DarkLight1337 Oct 18, 2024

DarkLight1337 left a comment

rebel-jonghewk commented Oct 22, 2024

pratyush0599 commented Oct 22, 2024 •

edited

Loading

mgoin commented Oct 22, 2024

pratyush0599 commented Oct 22, 2024

[Model] Support Pixtral models in the HF Transformers format #9036

[Model] Support Pixtral models in the HF Transformers format #9036

Conversation

mgoin commented Oct 3, 2024 • edited Loading

Offline multi-image example

Offline chat example

github-actions bot commented Oct 3, 2024

wuxiyiye commented Oct 9, 2024

mgoin commented Oct 9, 2024

mgoin commented Oct 16, 2024

DarkLight1337 left a comment

Choose a reason for hiding this comment

DarkLight1337 Oct 17, 2024

Choose a reason for hiding this comment

mgoin Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 Oct 18, 2024

Choose a reason for hiding this comment

DarkLight1337 left a comment

Choose a reason for hiding this comment

rebel-jonghewk commented Oct 22, 2024

pratyush0599 commented Oct 22, 2024 • edited Loading

mgoin commented Oct 22, 2024

pratyush0599 commented Oct 22, 2024

mgoin commented Oct 3, 2024 •

edited

Loading

mgoin Oct 17, 2024 •

edited

Loading

pratyush0599 commented Oct 22, 2024 •

edited

Loading