Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding enhanced cross-lingual retrieval benchmark by merging retrieval pools from different languages #1625

Open
yjoonjang opened this issue Dec 24, 2024 · 1 comment

Comments

@yjoonjang
Copy link
Contributor

yjoonjang commented Dec 24, 2024

Hi MTEB maintainers @KennethEnevoldsen, @Muennighoff

@seongtaehong and I were considering a way to make cross-lingual retrieval tasks more challenging by merging retrieval pools from two different languages.

Here’s the idea:

  • The task would be to retrieve two gold passages from a retrieval pool composed of content in two different languages.
  • The retrieval pool would consist of pairs of passages that have the same meaning but are written in different languages (e.g., StrategyQA and Ko-StrategyQA, with the latter being the Korean translation of StrategyQA).
  • Given a query in Korean, the model would need to retrieve the top 2 passages, ensuring the retrieved passages are in different languages. (And same for the query in English)

We believe this approach reflects a more realistic scenario, as many retrieval pools in the real world are derived from web crawling, and such pools naturally include data in multiple languages.
What are your thoughts on this idea? Let me know if you'd like me to adjust anything further!

@KennethEnevoldsen
Copy link
Contributor

Love the idea though I would not use translations for it. It leads to oddities which are fine for training, but horrible for evaluations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants