-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About Position Encoding #84
Comments
Hi @pisiguiii The content embedding and position embedding will be added instead of concatnated. |
Hi @Mountchicken , thanks for the great job! I also have some questions related to this issue.
But in the implementation, it actually first perform [C;C'] + [B;B'] result as a tensor with shape [k + 1, 256], and then feed it into a linear layer without change the shape, output the Q = linear([C;C'] + [B;B']) with shape [k + 1, 256]?
This with_pos_embed just simply return src + pos.
The option 1 seems add position embedding twice, and position embedding will remained in final V sounds not make sense. Could you help me understand this? Thank you! |
Hi @yu-xi-wang
Content embedding and position embedding will not be concatnated but added during attention. |
Hi @Mountchicken thank you so much for the reply! Yes, it make sense to me now! |
Hi @Mountchicken, hi @yu-xi-wang! I tried to implement the positional encoding code as I understood it from the article, but I ran into a problem that all my encoded boxes had almost identical embeddings. Maybe you can help me understand what I'm missing? This is my code:
|
@VilisovEvgeny def gen_sineembed_for_position(pos_tensor):
# n_query, bs, _ = pos_tensor.size()
# sineembed_tensor = torch.zeros(n_query, bs, 256)
scale = 2 * math.pi
dim_t = torch.arange(128, dtype=torch.float32, device=pos_tensor.device)
dim_t = 10000**(2 * (dim_t // 2) / 128)
x_embed = pos_tensor[:, :, 0] * scale
y_embed = pos_tensor[:, :, 1] * scale
pos_x = x_embed[:, :, None] / dim_t
pos_y = y_embed[:, :, None] / dim_t
pos_x = torch.stack((pos_x[:, :, 0::2].sin(), pos_x[:, :, 1::2].cos()),
dim=3).flatten(2)
pos_y = torch.stack((pos_y[:, :, 0::2].sin(), pos_y[:, :, 1::2].cos()),
dim=3).flatten(2)
if pos_tensor.size(-1) == 2:
pos = torch.cat((pos_y, pos_x), dim=2)
elif pos_tensor.size(-1) == 4:
w_embed = pos_tensor[:, :, 2] * scale
pos_w = w_embed[:, :, None] / dim_t
pos_w = torch.stack((pos_w[:, :, 0::2].sin(), pos_w[:, :, 1::2].cos()),
dim=3).flatten(2)
h_embed = pos_tensor[:, :, 3] * scale
pos_h = h_embed[:, :, None] / dim_t
pos_h = torch.stack((pos_h[:, :, 0::2].sin(), pos_h[:, :, 1::2].cos()),
dim=3).flatten(2)
pos = torch.cat((pos_y, pos_x, pos_w, pos_h), dim=2)
else:
raise ValueError("Unknown pos_tensor shape(-1):{}".format(
pos_tensor.size(-1)))
return pos |
@Mountchicken thanks for provided solution! But I'm a little confused, why does cosine similarity between obtained pos embeddings do not decreasing lower than 0.7? Is it a common behavior? |
I'm not sure if this is normal. Did you normalize the box coordinates to 0-1 before you got the position embedding? |
yes, I did. I also checked if used boxes in cxcywh format |
@Mountchicken could you help me with my issue which I described previously? Final global embeddings have much more similarity with others global embeddings then with final embeddings of their own classes. I'm following paper and use GroundingDINO DeformAttnDecoderLayer as a base. |
Hi @VilisovEvgeny |
Thanks for your reply, @Mountchicken! I visualized not only global embedding, but all embedding from final output (so there is one embedding for each unique sample per class per image and one global embedding). The main problem is that my global embeddings from one image for different classes is too much similar to each other, so when I'm trying to fit my pipeline it doesn't even pass sanity check on small amount of data. This is how looks similarity between global embeddings of different classes from one and different images: I understand, that this is too much to ask about, but I would be very grateful if you could tell me what the average similarity is between global embeddings obtained from different classes from the same image and from different images. |
Hi!
I want to ask, did you try to use instead of sin position encoder PE with learnable layer? If yes, how did it behave?
Also I'm interested, as I understand from paper, in final version of visual prompt processing you concatenate encoded boxes and content embeddings, so if we have encoded boxes embedding 256d and content embedding 256d => our final CAT(B, C) d will be 512. Did you try to summarize this embeddings? Like:
Q = Linear(B + C)
with d = 256?The text was updated successfully, but these errors were encountered: