Replies: 1 comment
-
re-surfacing this. Disregarding the context-inclusive nature of patch embeddings in geospatial AI needs a deeper dive. Averaging patches is just too blunt. We must actively explore how contextual information influences patch representations and develop techniques to leverage this for enhanced analysis. One crucial question lies in quantifying contextual influence. Can we measure how neighboring patches shape a given patch's embedding? Imagine a context-aware object detection model that leverages surrounding patch data to improve identification, particularly in challenging scenarios with occlusions or ambiguous features. To illustrate, consider a patch embedding representing a section of a river. Its representation might be subtly influenced by neighboring patches upstream or downstream sections of the river depicting lush vegetation, suggesting a slower-moving, meandering river, as opposed to one surrounded by rocky terrain. Understanding and quantifying such contextual influences can significantly enhance our ability to interpret and analyze that patch. Now, let's explore a more complex scenario. Imagine a patch embedding representing a green grassy field with a red house fully contained within it. Although such a configuration might be rare in the training data, it shouldn't adversely affect the model's overall performance. However, during inference, this context could offer valuable insights. The presence of the red house might subtly influence the grass embedding, hinting at the type of grass typically found near such structures, perhaps a specific variety common in regions with red-brick farmhouses, like in Utah. This nuanced understanding, derived from contextual cues, can refine our analysis and lead to more targeted interpretations. Drawing parallels to the text domain, we encounter similar phenomena. Consider the phrase "The cook was preparing a delicious antimatter cake in the oven." The unusual term "antimatter" subtly influences the embeddings of surrounding words, but in ways that might help since the same word "cook" is now potentially closer to physicist or the scene is set in a science fiction context. While rare occurrences like this might not significantly impact overall language model training, they can offer valuable contextual cues during inference, enabling a richer understanding of the text. In short, averaging might be killing the most important part, but we don't know how to deal with it. Yet. |
Beta Was this translation helpful? Give feedback.
-
A key use case for Clay is to find similar stuff. Give it a few examples of parking lots, and find more of those. Very quickly, the challenge becomes that small stuff is much smaller than the image. E.g. the image size is
512x512
at Sentinel2 resolution, so 5kmx5km, and you might want to find dams, or airports, or aquaculture, which might be ~100m. This is a dual problem:We've been moderately successful with patch embeddings similarity but there is one underlying fundamental issue. Patch embeddings are literally designed to depend on their context. The whole point of self-attention is to understand the presence and distribution of not only the semantics of the patch, but who that related to the ones around it: The same exact helipad image will have different patch embeding if on a ship, hospital or airport.
Transformers force word embeddings to distinguish among semantics given the context, and then we try to find the same word and struggle when they are different, as we forced them to be. the word "bank" is our patch, and we struggle when given "world
bank
" cannot find the "similar" case "riverbank
". In EO, it doesn't matter than our tokens (the patch) is actually an image that might have whole isolated semantics (like a car), it is forced to distinguish the same car given the context.It is only at the image level, not the patch that we get whole semantics.
With v0, the image size was fixed, and large, hence we needed the patch level. For v1 we are doing several resolutions, and several image sizes. This should enable us to generate embeddings for images much closer to the size of the semantics we are looking for.
My question:
@leothomas @MaceGrim @yellowcap @srmsoumya
related #222 #107
Beta Was this translation helpful? Give feedback.
All reactions