Replies: 1 comment 1 reply
-
Interesting thinking! In previous projects doing 2D segmentation or regression using CNN, I used the following:
This would also work for transformers, as it operates on the final output, not on embeddings. A single "class embedding" already does the summarization of the entire tile, so here I don't see an urgent need to worry about the edges. For finetune applicaions it would be interesting to see how a model would perform if one removes the patches at the edge! However, intuitively I would do the inference using all the patches and then handle edge effects post prediction with an algorithm like the one described above. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We split a raster into images, and then the transformer uses patch sizes (of 8x8) in the transformer. This means that the self-attention for each patch uses the rest of the image to learn the context. This ideal for the center patch, but patches in the corners of the image will only get the context the rest of the image. Since they are in the corner, this leaves only a fraction of the actual context.
The green patch at the center has good context for self-atttention, but the yellow patch on the corner will only have very limited context in some directions. It will not even see the adjacent patches, since they are on other images.
The easiest solution here is to create embeddings not by tiling a raster, but by sliding a window of the same size of the tile, and assigning the location of embedding to the patch at the center. This solution greatly multiplies the number of embeddings, and also creates a lot of overlaps of embeddings.
We do not know how to create geoembeddings that both are context aware (transformer) yet small enough, and assign them semantics bounds.
cc @srmsoumya @yellowcap @danhammer
Beta Was this translation helpful? Give feedback.
All reactions