Replies: 2 comments 1 reply
-
Overall mixing different columns in one vector fields should be fine. At least for different titles. But having category name embedded might produce a lot of noise of duplicated vectors, which is not good for vector index. |
Beta Was this translation helpful? Give feedback.
1 reply
-
I changed and clarified the question. Any idea is welcome. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a products table that has a lot of columns, which from these, the following ones are important for our search:
We are planning on using qdrant vector search to implement fast vector queries. But the problem is that all the data important for searching, are in different columns and I do not think (correct me if I am wrong) generating vector embeddings separately for all the columns is the best solution.
I came up with the idea of mixing the columns together and generating separate collections; and I came up with this solution because the title, the category, brand and attrs columns are essentially the same just in different langs.
Also I use the "BAAI/bge-m3" model which is a multilingual text embedding model that supports more than 100 langs.
So, in short, I created different collections for different languages, and for each collection I have a vector column containing the vector for the combined text of title, brand, color, and category in each language and when searched, because we already know which language the website is, we will search in that specific language collection.
Now, the question is, is this a valid method? What are the pros and cons of this method? I know for sure that when combined, I can not give different weights to different parts of this vector. For example one combined text of title, category, color, and brand may look like this:
"Koala patterned hoodie children blue Bubito"
or Something like:
"Striped t-shirt men navy blue Zara"
Now, user may search "blue hoodie for men", but due to the un-weighted structure of the combined vector, it will not retrieve the best results.
I may be wrong and this may be one of the best results, but please tell me more about the pros and cons of this method, and if you can, give me a better idea.
It is important to note that currently we have more than 300,000(300K) products and they will grow to more than 1,000,000 (1M) in the near future.
Beta Was this translation helpful? Give feedback.
All reactions