-
-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Visualizing Self-Attention in RT-DETR Encoder #478
Comments
Anchor1566 I want to ask you how to visualize the self-attention in Encoder, I also new in learning transformer too. |
You can refer to the code for a good starting point. Regarding RTDETR, I can share my implementation with you (I've managed to get the attention weights for the encoder, but I haven't correctly indexed the weights for the decoder yet). If you manage to figure that part out, I'd be really interested in seeing your solution as well. Wishing you all the best and hope you enjoy the learning process!
|
Thanks for the code, but i have some question. What is the purpose of wwpig in import utils.wwpig as wwpig ? How can I install it? After running pip install utils, I still get the error No module named 'utils.wwpig' |
Hello lyuwenyu,
First of all, thank you for your amazing work on RT-DETR! I’ve just started learning about object detection models, and I truly appreciate the innovations that make RT-DETR both faster and more efficient. I've starred the repository and look forward to diving deeper into the project!
I have a question regarding the self-attention visualization in the encoder of RT-DETR. In the original DETR paper, the multi-layer encoder, with multiple attention heads, progressively focuses on different regions of the image, allowing the model to capture fine-grained features such as object edges, shapes, and contours. These attention maps clearly highlight object structures.
In contrast, RT-DETR simplifies the encoder, reducing it to a single layer to minimize computational overhead and improve inference speed. When visualizing self-attention in RT-DETR, I noticed that the attention maps do not reveal such explicit object shapes or outlines as in DETR.
Is this simplification of the encoder, aimed at reducing computational complexity, responsible for the reduced ability to capture complex object features and therefore the lack of clear object correlations in the attention maps?
As a beginner, I’d love any guidance or insights you could provide on this topic!
Thank you again for your hard work, and I’m excited to continue learning from this project.
The text was updated successfully, but these errors were encountered: