Skip to content
This repository has been archived by the owner on Mar 12, 2024. It is now read-only.

Question about object queries #178

Open
Ww-Lee opened this issue Aug 4, 2020 · 9 comments
Open

Question about object queries #178

Ww-Lee opened this issue Aug 4, 2020 · 9 comments
Labels
question Further information is requested

Comments

@Ww-Lee
Copy link

Ww-Lee commented Aug 4, 2020

Hello, I have some questions about decoder layer.

  1. Can I think of object query as adaptive anchor?In paper, it is essentially position embedding, but I don't understand how it works actually, especially in first decoder layer. Not taking self-attention into account, only object queries do cross-correlation with image features, right? So in encoder-decoder attention, what exactly does the pos embedding do?

  2. In first decoder layer, what does self-attention do?

  3. How can 100 object queries learn to specialize on some certain areas and box sizes, instead of depending on a certain class? But I think this problem will be solved if I understand your answer to my questions above.

Can you help me out? Thanks very much.

@alcinos
Copy link
Contributor

alcinos commented Aug 5, 2020

Hi @Ww-Lee
Thank you for your interest in DETR.
You might be interested in our ECCV talk, specifically the section about object queries: https://youtu.be/utxbUlo9CyY?t=326

For your questions:

  1. By definition, "anchors" are something you use to make relative predictions. By contrast, in DETR, all predictions are made absolutely, hence we can't really talk about anchors here. You can think of the object queries as slots that the model can use to make its predictions, and it turns out experimentally that it will tend to reuse a given slot to predict objects in a given area of the image. Note that we use the terminology "position embedding" that is borrowed from the NLP literature, but in the decoder there is nothing that is "positional" since everything is a set hence permutation equivariant.

  2. The very first self-attention is useless. We experimentally verified that removing it does not change the performance. We left it to avoid complicating the code un-necessarily.

  3. I believe this question is answered in the video.

Best of luck

@alcinos alcinos added the question Further information is requested label Aug 5, 2020
@Ww-Lee
Copy link
Author

Ww-Lee commented Aug 6, 2020

Hi @Ww-Lee
Thank you for your interest in DETR.
You might be interested in our ECCV talk, specifically the section about object queries: https://youtu.be/utxbUlo9CyY?t=326

For your questions:

  1. By definition, "anchors" are something you use to make relative predictions. By contrast, in DETR, all predictions are made absolutely, hence we can't really talk about anchors here. You can think of the object queries as slots that the model can use to make its predictions, and it turns out experimentally that it will tend to reuse a given slot to predict objects in a given area of the image. Note that we use the terminology "position embedding" that is borrowed from the NLP literature, but in the decoder there is nothing that is "positional" since everything is a set hence permutation equivariant.
  2. The very first self-attention is useless. We experimentally verified that removing it does not change the performance. We left it to avoid complicating the code un-necessarily.
  3. I believe this question is answered in the video.

Best of luck

Thank you. And could you please explain to me how encoder memory responds to object queries in encoder-decoder attention mechanism, so that trained model can predict objects in a given area of the image? I really can't imagine the process like semantic similarity in NLP.

@jd730
Copy link
Contributor

jd730 commented Aug 21, 2020

Hi @alcinos, according to the video you shared, Are the object queries specialized in spatial location rather than the class?

Have you ever analyzed which class information is learnt by each object query?

Thank you in advance.

@fmassa
Copy link
Contributor

fmassa commented Aug 21, 2020

Hi @jd730

Are the object queries specialized in spatial location rather than the class?

The object queries seem to specialize spatially, and not per class.

Have you ever analyzed which class information is learnt by each object query?

We did some analysis, but we didn't see any clear specialization for the classes wrt object queries.

@jd730
Copy link
Contributor

jd730 commented Aug 21, 2020

Hi @fmassa,
Thank you for sharing your findings. It is really interesting that the queries are not specialized in classes. I thought that some of them are related to classes.
Maybe due to the transformer which conveys spatial information to the queries contributes the queries to spatialize spatially.

@HawkRong
Copy link

HawkRong commented Sep 3, 2020

I have not read DETR's code yet, but the conference paper gives me the impression that 'object queries' are a fixed number of uniformly sampled locations on the spatial scale of the feature map before a decoder. Am I right?

@fmassa
Copy link
Contributor

fmassa commented Sep 3, 2020

Hi @HawkRong ,

That's not correct.

I would suggest you have a look at Ross Girshick's CVPR tutorial on "Object Detection as a Machine Learning Problem", where he contrasts object queries from traditional approaches in detection.

The link with the exact timestamp is in here (but I would recommend watching the whole video) https://youtu.be/her4_rzx09o?t=1351

Additionally, @alcinos also posted an answer to a similar question in #178 (comment), with also another video that could help you understand it.

Let us know if you still have questions after checking those references.

@HawkRong
Copy link

HawkRong commented Sep 4, 2020

@fmassa Thank you for sharing. I'm very interested. Could you provide the video files since youtube.com is not accessible from my country.

@KaiserW
Copy link

KaiserW commented Sep 8, 2023

@fmassa Thank you for sharing. I'm very interested. Could you provide the video files since youtube.com is not accessible from my country.

https://share.weiyun.com/EgCBlIpD
Hi, since many years have past, I wish this is still helpful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants