The inference about LAPS #1

XpracticeYSKM · 2024-07-18T06:36:10Z

Thanks for your awesome work!

I wonder the patch selection and calibration will be conducted during inference？

In other words，the inference is consistent with training process where we need to apply LAPS for vision encoder and split sentence into textual words and then conduct patch-word alignment rather than image-sentence alignment?

darkpromise98 · 2024-07-18T08:27:40Z

Thanks for your attention, I am happy to answer your questions.

Question 1: Yes, the patch selection and calibration will are conducted in the inference stage.

Question 2: Yes, the inference process is consistent with training process, where we need to compute the patch-word alignment rather than image-sentence alignment. Since LAPS is a fine-grained alignment framework and different from the previous coarse-grained work (e.g., CLIP and series work).

Besides, you could use LAPS to compute patch-word alignment at training stage, and then compute the image-sentence alignment in the inference stage. In this way, the performance will drop slightly, but the computational efficiency will be greatly improved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The inference about LAPS #1

The inference about LAPS #1

XpracticeYSKM commented Jul 18, 2024 •

edited

Loading

darkpromise98 commented Jul 18, 2024

The inference about LAPS #1

The inference about LAPS #1

Comments

XpracticeYSKM commented Jul 18, 2024 • edited Loading

darkpromise98 commented Jul 18, 2024

XpracticeYSKM commented Jul 18, 2024 •

edited

Loading