Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new “How to choose an embedder” guide #3040

Open
guimachiavelli opened this issue Nov 11, 2024 · 3 comments · May be fixed by #3058
Open

Create new “How to choose an embedder” guide #3040

guimachiavelli opened this issue Nov 11, 2024 · 3 comments · May be fixed by #3058

Comments

@guimachiavelli
Copy link
Member

Recent customer feedback indicates users are struggling to move beyond the basic AI-powered search tutorial and implement hybrid search in their own projects.

Main points to address:

  • Define what are the main differences between available embedders
  • Ideally we should be less “this is the best option for use case X” and more “if your app does Y, choose embedders with high A”
    • pinging @dureuill: do you think it is possible to create general guidelines as described above? What are the things users should look at when trying to decide which embedder to choose?
@dureuill
Copy link
Contributor

dureuill commented Nov 12, 2024

Hey @guimachiavelli 👋

I guess we're in a bit of a situation where "if you have to ask, just use OpenAI".

A very rough outline could be:

  • If unsure, use OpenAI
  • If your app has a feature to search by images, audio, or anything that is not text, you need to embed these media separately and use the user-provided embedder.
  • If your app relies on a specific model or embedder, or you are already using a specific embedding provider (Azure, Mistral, etc), then use the REST embedder.
    • If the remote embedder is an ollama server, prefer the ollama embedder instead.
  • Otherwise, use OpenAI.
  • If you really cannot use OpenAI or any other embedding provider, consider the Hugging Face embedder. The Hugging Face embedder is best suited when you have a small number of documents (in the 10k) and don't intend to update them often.

@guimachiavelli
Copy link
Member Author

guimachiavelli commented Nov 12, 2024

Thanks for the reply, @dureuill, much appreciated!

A small follow-up regarding your second point: does the user-provided embedder suggestion applies to documents with no meaningful textual fields, to non-textual queries, or both? I have realised it's not completely clear to me how we accommodate users with non-textual documents.

@dureuill
Copy link
Contributor

dureuill commented Nov 12, 2024

does the user-provided embedder suggestion applies to documents that with no meaningful textual fields, to non-textual queries, or both?

Meilisearch does not support non-textual fields natively (you can include an image in a document as base64, or reference it via its URL, but you cannot meaningfully search that document from that image) in documents nor in search requests.

As soon as you use a user-provided embedder, you need to provide vectors both in your documents and in your semantic/hybrid search queries.

From there any combination of textual/non textual is possible: as image embedding models appear to generally first do image -> text, and then text -> embedding, one can choose to embed either text or image both at indexing and search time. All the embedding operations have to be done outside of Meilisearch, though.

@guimachiavelli guimachiavelli linked a pull request Nov 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants