Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return no results with a smaller K value but got results when K is bigger #152

Open
arkohut opened this issue Dec 9, 2024 · 3 comments
Open

Comments

@arkohut
Copy link

arkohut commented Dec 9, 2024

I encountered an issue where vector search results differ significantly depending on the value of K in a query. Specifically, when K = 96, the query returns no results, but when K = 384, results are returned successfully.

If this is a kind of KNN, I think there should get the same results with K = 96, right?

Steps to Reproduce

Here is the SQL query I am using for both cases:

SELECT DISTINCT entities.id 
FROM entities
JOIN entities_vec ON entities.id = entities_vec.rowid
JOIN entity_tags ON entities.id = entity_tags.entity_id
JOIN tags ON entity_tags.tag_id = tags.id
WHERE entities_vec.embedding MATCH :embedding
  AND entities.file_type_group = 'image'
  AND tags.name IN ('Firefox') 
  AND K = :limit

I used the following parameters for the query:

  1. When K = 96 Query results: []
  2. When K = 384 Query results: [68295, 67412, 68064, ...] (25 IDs in total)

Expected Behavior

I expect the query to return results even when K = 96, especially if there are results when K = 384. The results should scale proportionally to K if the dataset contains matching records.

Additional Information

  1. Plugin version: 0.1.6
  2. SQLite version: 3.45.3

BTW, this is a fantastic project! I’m leveraging it to deliver hybrid search results (FTS + vector-based) in my open-source AI project, Pensieve. I’m continually working to refine and enhance the quality of the search results.

@asg017
Copy link
Owner

asg017 commented Dec 10, 2024

Thanks for the report!

The problem here is that vec0 metadata filters can only be provided when it's a specified metadata column. Other WHERE clauses that come from other JOIN'ed tables are not available during the KNN filter, which is a limitation of SQLite virtual tables.

So in the provided SQL query, the entities_vec virtual table only sees the following constraints:

SELECT *
FROM entities_vec
WHERE embedding MATCH :embedding
  AND K = :limit

Meaning the entities_vec query planner does not see the file_type_group = 'image' or name IN ('Firefox') constraints. This is essentially the "post filtering problem" described in this article. When K = 384, then 25 of the returned results would satisfy the constraints, but when K=96, none of those 25 results satisfied the constraints.

You have two options to solve this: move the file_type_group and name columns to new metadata columns on the entities_vec table, or "eject" and perform KNN searches manually.

Option 1: Move columns to new metadata columns

create virtual table entities_vec using vec0(
  embedding float[1024], 
  file_type_group text
);

insert into entities_vec(rowid, embedding, file_type_group) values
  (1, '[...]', 'image');

select *
from entities_vec
where embedding match :embedding
  and k = :limit
  and file_type_group = 'image';

However, the "tags" concept doesn't port over well to vec0 metadata columns. I assume that a single "entity" can have multiple tags, like '["Firefox", "Excel", "Discord"]'? In that case, then vec0 metadata columns doesnt' quite support that. We manually capture and calculate constraints on metadata columns, and we dont have support for string arrays lookups.

You could use metadata_column IN (...) to ensure a metadata column value is part of a list, like:

and file_type_group in ('image', 'text', 'video')

But if the metadata column itself is an array of values, we don't have support for "array value in array" lookups yet.

Option 2: Do everything manually and pre-filter

Another option is to ditch vec0 virtual tables, pre-filter a list of candidates usiing regular SQL, then manually perform KNN on that subset using the "manually KNN" strategy described in the docs. That would look roughly like:

WITH subset AS (
  SELECT DISTINCT entities.id 
  FROM entities
  JOIN entity_tags ON entities.id = entity_tags.entity_id
  JOIN tags ON entity_tags.tag_id = tags.id
  WHERE  entities.file_type_group = 'image'
    AND tags.name IN ('Firefox') 
)
SELECT 
  subset.id,
  vec_distance_cosine(entities_vec.embedding, :query) as distance
FROM subset
LEFT JOIN entities_vec ON entities_vec.id = subset.id
ORDER BY distance
LIMIT :limit;

In this case, entities_vec can be a "regular table" to make things a bit faster. It's not as nice as the vec0 tables, since you have to calculate everything manually, but this would solve the "string array" tags problem.

Let me know if there anything I can help with! Would definitely like to get this officially documented. And Pensieve looks great, I'll check it out!

@SeanPedersen
Copy link

Hey,
I am having a similar use case and solved it like this for now:

file_ids = self.conn.execute(
    """
    SELECT DISTINCT f.file_id
    FROM files f
    JOIN tags_files tf ON tf.file_id = f.file_id
    JOIN tags t ON t.tag_id = tf.tag_id
    WHERE t.name = ?
    """,
    [tag_filter],
).fetchall()
file_ids = [str(file_id[0]) for file_id in file_ids]
if not file_ids:
    return []
file_ids_str = ",".join(file_ids)
rows = self.conn.execute(
    f"""
    SELECT
        e.file_id,
        f.name,
        f.type,
        f.md5_hash,
        distance
    FROM embeddings e
    JOIN files f ON f.file_id = e.file_id
    WHERE e.file_id IN ({file_ids_str})
    AND vector MATCH ?
    AND k = ?
    """,
    [query_vector, top_k],
).fetchall()

I am curious if this is the recommended way or is there a potential optimization possible? Hope to see the best-practice in the docs soon. Great project btw excited to see HNSW support soon.

@arkohut
Copy link
Author

arkohut commented Dec 10, 2024

Thanks for the reply. I am following the Option 2 but it seems that there will be performance issue when subset return a lot of ids? Right now there are about 100,000 records in my database.

2024-12-10 19:08:11,794 - INFO - Get embedding took 0.3194 seconds
2024-12-10 19:08:31,089 - INFO - SQL:
    WITH subset AS (
        SELECT DISTINCT entities.id as id
        FROM entities
        JOIN entity_tags ON entities.id = entity_tags.entity_id
        JOIN tags ON entity_tags.tag_id = tags.id
        WHERE entities.file_type_group = 'image'
    )
    SELECT
        subset.id,
        vec_distance_cosine(entities_vec.embedding, :embedding) as distance
    FROM subset
    LEFT JOIN entities_vec ON entities_vec.rowid = subset.id
    WHERE K = :limit
    ORDER BY distance

2024-12-10 19:08:31,090 - INFO - Params: limit: 384
2024-12-10 19:08:31,090 - INFO - Vector search results: []
2024-12-10 19:08:31,090 - INFO - Vector search execution time: 19.2951 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants