What do we do with Patch embeddings? #244

brunosan · 2024-04-19T16:33:07Z

brunosan
Apr 19, 2024
Maintainer

A key use case for Clay is to find similar stuff. Give it a few examples of parking lots, and find more of those. Very quickly, the challenge becomes that small stuff is much smaller than the image. E.g. the image size is 512x512 at Sentinel2 resolution, so 5kmx5km, and you might want to find dams, or airports, or aquaculture, which might be ~100m. This is a dual problem:

You can only select whole images, so the code will have trouble understanding what of the many small things you actually wanted. The only solution here is to give it a few positive and negative examples, to filter down the actual semantic intended. We've been using that. Is like playing who is who selecting attributes across samples.
Even if you do find the right stuff in other images, you don't know where your stuff is within the image.

We've been moderately successful with patch embeddings similarity but there is one underlying fundamental issue. Patch embeddings are literally designed to depend on their context. The whole point of self-attention is to understand the presence and distribution of not only the semantics of the patch, but who that related to the ones around it: The same exact helipad image will have different patch embeding if on a ship, hospital or airport.

Transformers force word embeddings to distinguish among semantics given the context, and then we try to find the same word and struggle when they are different, as we forced them to be. the word "bank" is our patch, and we struggle when given "world bank" cannot find the "similar" case "river bank". In EO, it doesn't matter than our tokens (the patch) is actually an image that might have whole isolated semantics (like a car), it is forced to distinguish the same car given the context.

It is only at the image level, not the patch that we get whole semantics.

With v0, the image size was fixed, and large, hence we needed the patch level. For v1 we are doing several resolutions, and several image sizes. This should enable us to generate embeddings for images much closer to the size of the semantics we are looking for.

My question:

how to merge all those patch embeddings into a single image embedding (or how word embeddings are combined to get a sentence embedding). Is the average acceptable?
Do we then need to generate embeddings at several image sizes so we can find stuff at different sizes? E.g. embeddings for stuff at 10m, 100m, 1km, ...

@leothomas @MaceGrim @yellowcap @srmsoumya

related #222 #107

brunosan · 2024-10-01T08:16:22Z

brunosan
Oct 1, 2024
Maintainer Author

re-surfacing this. Disregarding the context-inclusive nature of patch embeddings in geospatial AI needs a deeper dive. Averaging patches is just too blunt. We must actively explore how contextual information influences patch representations and develop techniques to leverage this for enhanced analysis.

One crucial question lies in quantifying contextual influence. Can we measure how neighboring patches shape a given patch's embedding? Imagine a context-aware object detection model that leverages surrounding patch data to improve identification, particularly in challenging scenarios with occlusions or ambiguous features.

To illustrate, consider a patch embedding representing a section of a river. Its representation might be subtly influenced by neighboring patches upstream or downstream sections of the river depicting lush vegetation, suggesting a slower-moving, meandering river, as opposed to one surrounded by rocky terrain. Understanding and quantifying such contextual influences can significantly enhance our ability to interpret and analyze that patch.

Now, let's explore a more complex scenario. Imagine a patch embedding representing a green grassy field with a red house fully contained within it. Although such a configuration might be rare in the training data, it shouldn't adversely affect the model's overall performance. However, during inference, this context could offer valuable insights. The presence of the red house might subtly influence the grass embedding, hinting at the type of grass typically found near such structures, perhaps a specific variety common in regions with red-brick farmhouses, like in Utah. This nuanced understanding, derived from contextual cues, can refine our analysis and lead to more targeted interpretations.

Drawing parallels to the text domain, we encounter similar phenomena. Consider the phrase "The cook was preparing a delicious antimatter cake in the oven." The unusual term "antimatter" subtly influences the embeddings of surrounding words, but in ways that might help since the same word "cook" is now potentially closer to physicist or the scene is set in a science fiction context. While rare occurrences like this might not significantly impact overall language model training, they can offer valuable contextual cues during inference, enabling a richer understanding of the text.

In short, averaging might be killing the most important part, but we don't know how to deal with it. Yet.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What do we do with Patch embeddings? #244

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What do we do with Patch embeddings? #244

Uh oh!

brunosan Apr 19, 2024 Maintainer

Replies: 1 comment

Uh oh!

brunosan Oct 1, 2024 Maintainer Author

brunosan
Apr 19, 2024
Maintainer

brunosan
Oct 1, 2024
Maintainer Author