You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Visual prompt embedding is indeed used for both final classification and decoder query selection. However, the input to the decoder is not the visual prompt embedding. In query selection, we calculate the similarity between the visual prompt and each pixel in the encoder's output, and then we select the top N (N=900) pixels. At the positions of these selected pixels, some predefined anchors of fixed sizes are placed. These anchors are used to initialize the decoder's position embedding, but the content embedding of the decoder consists of N learnable vectors.
Did I understand correctly that the same prompt vector is used for both classification and selection of the corresponding Q input boxes embeddings?
If so, then it turns out that the Qdec decoder does not change much, but is simply refinished?
(I am familiar with the operating principle of DETR, but nevertheless this point is not completely clear to me)
Thanks!
The text was updated successfully, but these errors were encountered: