Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi-modal models #662

Closed
rlouf opened this issue Feb 14, 2024 · 5 comments · Fixed by #1052
Closed

Support for multi-modal models #662

rlouf opened this issue Feb 14, 2024 · 5 comments · Fixed by #1052
Labels
enhancement llama.cpp Related to the `llama.cpp` integration transformers Linked to the `transformers` integration

Comments

@rlouf
Copy link
Member

rlouf commented Feb 14, 2024

Presentation of the new feature

There are more and more accessible multi-modal models out there, such as llava, and constrained generation applies to every auto-regressive text generation model disregarding their input.

Where does it fit in Outlines?

Maybe the most reasonable way would be to let users pass tuples (prompt, image) to the API functions and use multipledispatch to dispatch both on model and prompt. Or create a new MultimodalModel class and only dispatch on the model type like we currently do.

We need to make sure users can't unknowingly shoot themselves in the foot, the MultimodalModal class would make this easy.

My main concern is that we might need to make generator more complex, or duplicate part of it.

Are you willing to open a PR?

Yes, although I'd appreciate if someone else were willing to take the lead. Happy to help with the design.

@rlouf rlouf added enhancement transformers Linked to the `transformers` integration llama.cpp Related to the `llama.cpp` integration labels Feb 14, 2024
@lapp0
Copy link
Contributor

lapp0 commented Feb 14, 2024

Here is what the transformers interface looks like

https://github.com/huggingface/transformers/blob/354775bc5755c4a6c47e008d28f27f8ccdcf8f8f/src/transformers/models/llava/modeling_llava.py#L377-L395

    >>> from PIL import Image
    >>> import requests
    >>> from transformers import AutoProcessor, LlavaForConditionalGeneration

    >>> model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
    >>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

    >>> prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"
    >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
    >>> image = Image.open(requests.get(url, stream=True).raw)

    >>> inputs = processor(text=prompt, images=image, return_tensors="pt")

    >>> # Generate
    >>> generate_ids = model.generate(**inputs, max_length=30)
    >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    "\nUSER: What's the content of the image?\nASSISTANT: The image features a stop sign on a street corner"

inputs.items() contains input_ids, attention_mask, and pixel_values

I agree regarding complexity, the generator would need to manage pixel_values as well. The main difference would be augmenting the attention mask, which would need to be performed in sequence_generator, as this augmentation is applied with every forward pass. How transformers does it:

https://github.com/huggingface/transformers/blob/354775bc5755c4a6c47e008d28f27f8ccdcf8f8f/src/transformers/models/llava/modeling_llava.py#L430-L433

I propose:

  • We create an abstract LLaVaModel and allow passing an images kwarg to generator.
  • We create a LlavaSequenceGenerator subclass which handles the necessarily logic for multi-modal models. This subclass is to be used when a LLaVaModel is used.
  • We have two options for sequence_generator
    • a. Create a separate llava_sequence_generator which applies the image_features to the attention mask
    • b. We refactor SequenceGenerator to SequenceGenerator.gen_tokens which is overridden by LLaVaSequenceGenerator (I prefer this one, I think unifying the API into one module makes a lot of sense in terms of clearness and composability)

Would love to know your thoughts!

@Reichenbachian
Copy link

Any updates on this thread?

@Kamakshi8104
Copy link

Hey! I just wanted to know if multimodal models can be used with the connector being implemented in issue #728

@rlouf
Copy link
Member Author

rlouf commented Mar 12, 2024

Yes you should be able to use this with multimodal models!

@cpfiffer
Copy link
Contributor

For anyone who stumbles on this later, check out the relevant cookbook for working with vision models here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement llama.cpp Related to the `llama.cpp` integration transformers Linked to the `transformers` integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants