-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual IR with Machine-Translated FAQ #46
Comments
Great idea. @stedomedo! Did I get this right, that we would still need language-specific models for question similarity with this approach? Would it be an alternative to translate the user question live to English and then to the matching with our FAQs? With that approach we could easily leverage English models for question similarity. |
Yes, that's an option. |
The Here is the English FAQ data including columns for Arabic: |
And the MS translator: MS Translator is supposed to be quite good for Arabic. For other languages, Google or DeepL are better options (afaik they don't offer free credits) Checking which real-time translation option is best to use, incl. budget-wise |
@tholor @Timoeller I have a question on the (desired) search workflow. So could a multilingual workflow be like this: |
Good points. Can you create a PR with the translation and the script for doing so? I would merge it to have this functionality in the repo. About the language detection and the switch between bert + ES and only ES: we could implement it this way if multilingual isnt working well for other languages. Do you have experience with language detection and could write a script for this, so we can integrate this into the backend? We need lang detection there anyways, because we want to adjust output texts like "source" "cateogry" etc. The script should be rather efficient, since this will limit response time... |
One idea for "simple" transfer learning: |
That is exactly the idea! : ) So if we train a multilingual model in Sentence Bert on Quora, we will also be able to match all other languages - hopefully with good performance 💃 |
You are probably aware of these datasets but heres some multilingual similarity data. I have a NMT model for english->swedish if you want me to I could NMT and add some data for better performance on scandinavian languages. https://github.com/google-research-datasets/paws |
Building multilingual models (zero-shot, transfer learning, etc.) takes time.
So, in the meantime, as stated in #2 , we could machine-translate FAQs from English into other languages and add them to the search cluster, so that they can be retrieved at foreign language input. Translations in the background don't need to be perfect, but sufficient for retrieval (adequacy before fluency/grammar).
TODOs:
data/scrapers
repoThe text was updated successfully, but these errors were encountered: