Multimodal-in and multimodal-out #18

JoyBoy-Su · 2024-07-11T07:00:03Z

We will implement the script so that the model can take image as input.

JoyBoy-Su · 2024-07-15T08:20:47Z

We provide a script for multimodal inference: you can follow the instructions to run the script.

Mr-Loevan · 2024-07-16T09:12:39Z

We provide a script for multimodal inference: you can follow the instructions to run the script.

Thanks for you good job! I tried the multi-modal in and out script, But it generates nothing when prompt to generate images. What's the possible reason?

JoyBoy-Su · 2024-07-17T09:19:50Z

@Mr-Loevan Hi，can you give us more details? For example, your input.json and the output of your model.

I just tried to use the following input.json for inference:

[
    {
        "type": "text",
        "content": "Draw a picture showing a serene lakeside view at sunrise with mist rising from the water, surrounded by dense pine forests and mountains in the background."
    }
]

The output of the model is as follows:

It is a picturesque scene that reflects the beauty of nature in all its glory. The image captures the early morning hours when the sun rises over the horizon, casting a warm glow over the landscape. The lake surface is mirror-like, creating a reflection of the surrounding trees and mountains. There is a sense of tranquility and peace in the air, as if the area is protected from the hustle and bustle of everyday life.
<img: ./outputs/inference/1.png>

./outputs/inference/1.png:

URRealHero · 2024-09-10T12:58:37Z

Hello there, I found similar problem, and the uncertainty is really high, if I rerun my script(modified base on inference.py), same prompt will lead to different results.
There are three states in multi-in multi-out:
most of the time, the model output nothing
e.g.

part of input:
{
        "type": "text",
        "content": "Select and extract [wine, condiment bottle, bread, glass, beverage bottle, , toaster] from the image . For each object, generate a separate and independent image that closely resembles its state. Output the object image and its detailed caption according to the sequence of previous list. \nOutput Requirement: Start with the whole image description.  Then, for each object, display the object's image following its caption. When multiple objects interact, describe them together with conjunctions. \nFor example: [Whole image Description]. <Object 1 image> [Object 1 caption]. <Object 2 image> <Object 3 image> [Object 2 caption] (conjunction) [Object 3 caption]. <Object 4 image> [Object 4 caption] ..."
      },
      {
        "type": "image",
        "content": "xxx/0000_q1.jpg"
      }
output:
[
    {
        "type": "text",
        "content": ""
    }
]

sometimes it generate text only:
Input:

part of input:
{
        "type": "text",
        "content": "Select and extract [commode, tissue box, plunger, trash can] from the image . For each object, generate a separate and independent image for each object that closely resembles its state. Display the object image followed by its detailed caption according to the sequence of the previous list. \nOutput Requirement: Start with the whole image description.  Then, for each object, display the object's image following its caption. When multiple objects interact, describe them together with conjunctions. \nFor example: [Whole image Description]. <Object 1 image> [Object 1 caption]. <Object 2 image> <Object 3 image> [Object 2 caption] (conjunction) [Object 3 caption]. <Object 4 image> [Object 4 caption]. ..."
      },
      {
        "type": "image",
        "content": "xxx/0002_q1.jpg"
      }
output:
[
    {
        "type": "text",
        "content": "The above image depicts a toilet bowl situated in a small bathroom. The toilet bowl is positioned on the left side of the image, and its lid is open, revealing the inner workings of the toilet. Behind the toilet, there is a white commode with a clear glass bowl, which has an open lid, exposing the contents. There is also a blue tissue box positioned to the right of the toilet. In the lower-right corner, there is a white plunger and a chrome-plated wastebasket behind the toilet."
    }
]

Very rarely, only one image is generated
Input:

partly input:
{
        "type": "text",
        "content": "Select and extract [sink, oven, hearth] from the image . For each object, generate a separate and independent image for each object that closely resembles its state. Display the object image followed by its detailed caption according to the sequence of the previous list. \nOutput Requirement: Start with the whole image description.  Then, for each object, display the object's image following its caption. When multiple objects interact, describe them together with conjunctions. \nFor example: [Whole image Description]. <Object 1 image> [Object 1 caption]. <Object 2 image> <Object 3 image> [Object 2 caption] (conjunction) [Object 3 caption]. <Object 4 image> [Object 4 caption]. ..."
},
      {
        "type": "image",
        "content": "xxx/0009_q1.jpg"
      }

output:
[
    {
        "type": "text",
        "content": "The image shows a modern kitchen with stainless steel appliances, including a gas oven, a stainless steel sink, and a built-in stainless steel refrigerator. A stainless steel cooking pan is seen on the stove. The countertops are a neutral beige and black marble, respectively, and the floor is black. The walls are white. \n\nThe image below depicts a blue-faced stove:"
    },
    {
        "type": "image",
        "content": "examples/0009/image_1.png"
    },
    {
        "type": "text",
        "content": "The stove is a gas stove, marked by its blue face."
    }
]
![image](https://github.com/user-attachments/assets/38aed349-0400-44ae-9f08-f0fd78e996eb)

If you have any solution to this, please notify me

URRealHero · 2024-09-12T01:12:03Z

I found that if I remove the requirement of format, it would generate more things

JoyBoy-Su added enhancement New feature or request priority: high Issue with high priority inference Something about inference labels Jul 11, 2024

leloykun mentioned this issue Jul 11, 2024

Enable interleaved image-image generation w/ Chameleon & Anole huggingface/transformers#31919

Closed

leloykun mentioned this issue Jul 17, 2024

Improve support for image generation with Chameleon & Anole huggingface/transformers#32013

Open

39 tasks

URRealHero mentioned this issue Sep 16, 2024

Inference Problem #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal-in and multimodal-out #18

Multimodal-in and multimodal-out #18

JoyBoy-Su commented Jul 11, 2024

JoyBoy-Su commented Jul 15, 2024

Mr-Loevan commented Jul 16, 2024

JoyBoy-Su commented Jul 17, 2024

URRealHero commented Sep 10, 2024 •

edited

Loading

URRealHero commented Sep 12, 2024

Multimodal-in and multimodal-out #18

Multimodal-in and multimodal-out #18

Comments

JoyBoy-Su commented Jul 11, 2024

JoyBoy-Su commented Jul 15, 2024

Mr-Loevan commented Jul 16, 2024

JoyBoy-Su commented Jul 17, 2024

URRealHero commented Sep 10, 2024 • edited Loading

URRealHero commented Sep 12, 2024

URRealHero commented Sep 10, 2024 •

edited

Loading