-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Question about object queries #178
Comments
Hi @Ww-Lee For your questions:
Best of luck |
Thank you. And could you please explain to me how encoder memory responds to object queries in encoder-decoder attention mechanism, so that trained model can predict objects in a given area of the image? I really can't imagine the process like semantic similarity in NLP. |
Hi @alcinos, according to the video you shared, Are the object queries specialized in spatial location rather than the class? Have you ever analyzed which class information is learnt by each object query? Thank you in advance. |
Hi @jd730
The object queries seem to specialize spatially, and not per class.
We did some analysis, but we didn't see any clear specialization for the classes wrt object queries. |
Hi @fmassa, |
I have not read DETR's code yet, but the conference paper gives me the impression that 'object queries' are a fixed number of uniformly sampled locations on the spatial scale of the feature map before a decoder. Am I right? |
Hi @HawkRong , That's not correct. I would suggest you have a look at Ross Girshick's CVPR tutorial on "Object Detection as a Machine Learning Problem", where he contrasts object queries from traditional approaches in detection. The link with the exact timestamp is in here (but I would recommend watching the whole video) https://youtu.be/her4_rzx09o?t=1351 Additionally, @alcinos also posted an answer to a similar question in #178 (comment), with also another video that could help you understand it. Let us know if you still have questions after checking those references. |
@fmassa Thank you for sharing. I'm very interested. Could you provide the video files since youtube.com is not accessible from my country. |
https://share.weiyun.com/EgCBlIpD |
Hello, I have some questions about decoder layer.
Can I think of object query as adaptive anchor?In paper, it is essentially position embedding, but I don't understand how it works actually, especially in first decoder layer. Not taking self-attention into account, only object queries do cross-correlation with image features, right? So in encoder-decoder attention, what exactly does the pos embedding do?
In first decoder layer, what does self-attention do?
How can 100 object queries learn to specialize on some certain areas and box sizes, instead of depending on a certain class? But I think this problem will be solved if I understand your answer to my questions above.
Can you help me out? Thanks very much.
The text was updated successfully, but these errors were encountered: