-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return no results with a smaller K value but got results when K is bigger #152
Comments
Thanks for the report! The problem here is that So in the provided SQL query, the SELECT *
FROM entities_vec
WHERE embedding MATCH :embedding
AND K = :limit Meaning the You have two options to solve this: move the Option 1: Move columns to new metadata columnscreate virtual table entities_vec using vec0(
embedding float[1024],
file_type_group text
);
insert into entities_vec(rowid, embedding, file_type_group) values
(1, '[...]', 'image');
select *
from entities_vec
where embedding match :embedding
and k = :limit
and file_type_group = 'image'; However, the "tags" concept doesn't port over well to vec0 metadata columns. I assume that a single "entity" can have multiple tags, like '["Firefox", "Excel", "Discord"]'? In that case, then vec0 metadata columns doesnt' quite support that. We manually capture and calculate constraints on metadata columns, and we dont have support for string arrays lookups. You could use and file_type_group in ('image', 'text', 'video') But if the metadata column itself is an array of values, we don't have support for "array value in array" lookups yet. Option 2: Do everything manually and pre-filterAnother option is to ditch WITH subset AS (
SELECT DISTINCT entities.id
FROM entities
JOIN entity_tags ON entities.id = entity_tags.entity_id
JOIN tags ON entity_tags.tag_id = tags.id
WHERE entities.file_type_group = 'image'
AND tags.name IN ('Firefox')
)
SELECT
subset.id,
vec_distance_cosine(entities_vec.embedding, :query) as distance
FROM subset
LEFT JOIN entities_vec ON entities_vec.id = subset.id
ORDER BY distance
LIMIT :limit; In this case, Let me know if there anything I can help with! Would definitely like to get this officially documented. And Pensieve looks great, I'll check it out! |
Hey, file_ids = self.conn.execute(
"""
SELECT DISTINCT f.file_id
FROM files f
JOIN tags_files tf ON tf.file_id = f.file_id
JOIN tags t ON t.tag_id = tf.tag_id
WHERE t.name = ?
""",
[tag_filter],
).fetchall()
file_ids = [str(file_id[0]) for file_id in file_ids]
if not file_ids:
return []
file_ids_str = ",".join(file_ids)
rows = self.conn.execute(
f"""
SELECT
e.file_id,
f.name,
f.type,
f.md5_hash,
distance
FROM embeddings e
JOIN files f ON f.file_id = e.file_id
WHERE e.file_id IN ({file_ids_str})
AND vector MATCH ?
AND k = ?
""",
[query_vector, top_k],
).fetchall() I am curious if this is the recommended way or is there a potential optimization possible? Hope to see the best-practice in the docs soon. Great project btw excited to see HNSW support soon. |
Thanks for the reply. I am following the Option 2 but it seems that there will be performance issue when subset return a lot of ids? Right now there are about 100,000 records in my database. 2024-12-10 19:08:11,794 - INFO - Get embedding took 0.3194 seconds
2024-12-10 19:08:31,089 - INFO - SQL:
WITH subset AS (
SELECT DISTINCT entities.id as id
FROM entities
JOIN entity_tags ON entities.id = entity_tags.entity_id
JOIN tags ON entity_tags.tag_id = tags.id
WHERE entities.file_type_group = 'image'
)
SELECT
subset.id,
vec_distance_cosine(entities_vec.embedding, :embedding) as distance
FROM subset
LEFT JOIN entities_vec ON entities_vec.rowid = subset.id
WHERE K = :limit
ORDER BY distance
2024-12-10 19:08:31,090 - INFO - Params: limit: 384
2024-12-10 19:08:31,090 - INFO - Vector search results: []
2024-12-10 19:08:31,090 - INFO - Vector search execution time: 19.2951 seconds |
I encountered an issue where vector search results differ significantly depending on the value of K in a query. Specifically, when K = 96, the query returns no results, but when K = 384, results are returned successfully.
If this is a kind of KNN, I think there should get the same results with K = 96, right?
Steps to Reproduce
Here is the SQL query I am using for both cases:
I used the following parameters for the query:
Expected Behavior
I expect the query to return results even when K = 96, especially if there are results when K = 384. The results should scale proportionally to K if the dataset contains matching records.
Additional Information
0.1.6
3.45.3
BTW, this is a fantastic project! I’m leveraging it to deliver hybrid search results (FTS + vector-based) in my open-source AI project, Pensieve. I’m continually working to refine and enhance the quality of the search results.
The text was updated successfully, but these errors were encountered: