Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal-in and multimodal-out #18

Open
JoyBoy-Su opened this issue Jul 11, 2024 · 5 comments
Open

Multimodal-in and multimodal-out #18

JoyBoy-Su opened this issue Jul 11, 2024 · 5 comments
Labels
enhancement New feature or request inference Something about inference priority: high Issue with high priority

Comments

@JoyBoy-Su
Copy link
Collaborator

We will implement the script so that the model can take image as input.

@JoyBoy-Su
Copy link
Collaborator Author

We provide a script for multimodal inference: you can follow the instructions to run the script.

@Mr-Loevan
Copy link

We provide a script for multimodal inference: you can follow the instructions to run the script.

Thanks for you good job! I tried the multi-modal in and out script, But it generates nothing when prompt to generate images. What's the possible reason?

@JoyBoy-Su
Copy link
Collaborator Author

@Mr-Loevan Hi,can you give us more details? For example, your input.json and the output of your model.

I just tried to use the following input.json for inference:

[
    {
        "type": "text",
        "content": "Draw a picture showing a serene lakeside view at sunrise with mist rising from the water, surrounded by dense pine forests and mountains in the background."
    }
]

The output of the model is as follows:

It is a picturesque scene that reflects the beauty of nature in all its glory. The image captures the early morning hours when the sun rises over the horizon, casting a warm glow over the landscape. The lake surface is mirror-like, creating a reflection of the surrounding trees and mountains. There is a sense of tranquility and peace in the air, as if the area is protected from the hustle and bustle of everyday life.
<img: ./outputs/inference/1.png>

./outputs/inference/1.png:
image

@URRealHero
Copy link

URRealHero commented Sep 10, 2024

Hello there, I found similar problem, and the uncertainty is really high, if I rerun my script(modified base on inference.py), same prompt will lead to different results.
There are three states in multi-in multi-out:
most of the time, the model output nothing
e.g.
image

part of input:
{
        "type": "text",
        "content": "Select and extract [wine, condiment bottle, bread, glass, beverage bottle, , toaster] from the image . For each object, generate a separate and independent image that closely resembles its state. Output the object image and its detailed caption according to the sequence of previous list. \nOutput Requirement: Start with the whole image description.  Then, for each object, display the object's image following its caption. When multiple objects interact, describe them together with conjunctions. \nFor example: [Whole image Description]. <Object 1 image> [Object 1 caption]. <Object 2 image> <Object 3 image> [Object 2 caption] (conjunction) [Object 3 caption]. <Object 4 image> [Object 4 caption] ..."
      },
      {
        "type": "image",
        "content": "xxx/0000_q1.jpg"
      }
output:
[
    {
        "type": "text",
        "content": ""
    }
]

sometimes it generate text only:
Input:
image

part of input:
{
        "type": "text",
        "content": "Select and extract [commode, tissue box, plunger, trash can] from the image . For each object, generate a separate and independent image for each object that closely resembles its state. Display the object image followed by its detailed caption according to the sequence of the previous list. \nOutput Requirement: Start with the whole image description.  Then, for each object, display the object's image following its caption. When multiple objects interact, describe them together with conjunctions. \nFor example: [Whole image Description]. <Object 1 image> [Object 1 caption]. <Object 2 image> <Object 3 image> [Object 2 caption] (conjunction) [Object 3 caption]. <Object 4 image> [Object 4 caption]. ..."
      },
      {
        "type": "image",
        "content": "xxx/0002_q1.jpg"
      }
output:
[
    {
        "type": "text",
        "content": "The above image depicts a toilet bowl situated in a small bathroom. The toilet bowl is positioned on the left side of the image, and its lid is open, revealing the inner workings of the toilet. Behind the toilet, there is a white commode with a clear glass bowl, which has an open lid, exposing the contents. There is also a blue tissue box positioned to the right of the toilet. In the lower-right corner, there is a white plunger and a chrome-plated wastebasket behind the toilet."
    }
]

Very rarely, only one image is generated
Input:
image

partly input:
{
        "type": "text",
        "content": "Select and extract [sink, oven, hearth] from the image . For each object, generate a separate and independent image for each object that closely resembles its state. Display the object image followed by its detailed caption according to the sequence of the previous list. \nOutput Requirement: Start with the whole image description.  Then, for each object, display the object's image following its caption. When multiple objects interact, describe them together with conjunctions. \nFor example: [Whole image Description]. <Object 1 image> [Object 1 caption]. <Object 2 image> <Object 3 image> [Object 2 caption] (conjunction) [Object 3 caption]. <Object 4 image> [Object 4 caption]. ..."
},
      {
        "type": "image",
        "content": "xxx/0009_q1.jpg"
      }

output:
[
    {
        "type": "text",
        "content": "The image shows a modern kitchen with stainless steel appliances, including a gas oven, a stainless steel sink, and a built-in stainless steel refrigerator. A stainless steel cooking pan is seen on the stove. The countertops are a neutral beige and black marble, respectively, and the floor is black. The walls are white. \n\nThe image below depicts a blue-faced stove:"
    },
    {
        "type": "image",
        "content": "examples/0009/image_1.png"
    },
    {
        "type": "text",
        "content": "The stove is a gas stove, marked by its blue face."
    }
]
![image](https://github.com/user-attachments/assets/38aed349-0400-44ae-9f08-f0fd78e996eb)

If you have any solution to this, please notify me

@URRealHero
Copy link

I found that if I remove the requirement of format, it would generate more things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request inference Something about inference priority: high Issue with high priority
Projects
None yet
Development

No branches or pull requests

3 participants