Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Visual Prompt Encoder with negative visual prompts #91

Open
feivellau opened this issue Dec 6, 2024 · 6 comments
Open

About Visual Prompt Encoder with negative visual prompts #91

feivellau opened this issue Dec 6, 2024 · 6 comments

Comments

@feivellau
Copy link

Hi, thanks for the great job! I also have a question related to this paper.
The paper mentions that negative example prompts can be utilized. Could you please explain how negative example prompts are implemented in the training strategy?

@Mountchicken
Copy link
Collaborator

In training, negative prompts are used to contrast categories and help the model learn better separations. For example, in a mini-batch with a batch size of 2, you may have categories A and B in Figure 1, and C and D in Figure 2. The embeddings for all these categories (A, B, C, and D) are gathered during training. When computing the classification loss, positive prompts are pairs from the same category, while negative prompts are from different categories. For example, B, C, and D are negative prompts to category A; C, D, and A are negative prompts to category B.

@feivellau
Copy link
Author

feivellau commented Dec 8, 2024

What I understand is that during the training process, except for the current category, all others are negative examples?
If so, there is also a question in inference, how to filter targets by negative prompts?

@Mountchicken
Copy link
Collaborator

In inference, using negative prompts is essentially a multi-class classification problem. For example, if you're detecting red apples, but the image also contains green apples, providing only visual prompts for red apples might cause the model to detect the green apples as well. In this case, you can provide both visual prompts for red and green apples, transforming the problem into a binary classification task. Then, you can select the bounding box with the detection result as red apples at the end.

@feivellau
Copy link
Author

I understand. I initially thought it was a similar implementation to SAM, which involved multiple iterations of prompts and negative prompts.
Can it be understood this way: by calculating the similarity between the output of the classification encoding and the prompt encoding, we can determine which prompt the current instance belongs to, thereby achieving the goal of multi-class detection or filtering?

@Mountchicken
Copy link
Collaborator

Yes, this is the way.

@feivellau
Copy link
Author

I have another question. When making predictions, if the target object is not present in the image, is it filtered based on the similarity threshold?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants