fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query embeds #11335

bclavie · 2024-02-23T21:27:12Z

Description

Fixes the similarity calculation for the ColBERTReranker.

The current approach pools the query representation and performs cosine similarity for each document token against the single-vector query representation, whereas the original ColBERT maxsim implementation does so at the token level (i.e. it compares each query token to each document token).
This PR fixes this slight issue by removing the mean pooling.

Fixes # (issue)

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

…ry embedding

hatianzhang · 2024-02-23T21:42:26Z

@bclavie thanks for the fix.

…n vs pooled query embeds (run-llama#11335) fix: calculate ColBERT similarity per token rather than vs pooled query embedding

fix: calculate ColBERT similarity per token rather than vs pooled que…

a4dcc5e

…ry embedding

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Feb 23, 2024

bclavie changed the title ~~fix: calculate ColBERT similarity per token rather than vs pooled query beds~~ fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query beds Feb 23, 2024

bclavie changed the title ~~fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query beds~~ fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query embeds Feb 23, 2024

hatianzhang approved these changes Feb 23, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 23, 2024

hatianzhang merged commit b285c6f into run-llama:main Feb 23, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query embeds #11335

fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query embeds #11335

bclavie commented Feb 23, 2024 •

edited

Loading

hatianzhang commented Feb 23, 2024

fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query embeds #11335

fix(ColbertRerank): calculate ColBERT similarity per token rather than vs pooled query embeds #11335

Conversation

bclavie commented Feb 23, 2024 • edited Loading

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

hatianzhang commented Feb 23, 2024

bclavie commented Feb 23, 2024 •

edited

Loading