Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: metadata filtering #26

Closed
asg017 opened this issue Jun 21, 2024 · 7 comments
Closed

Tracking issue: metadata filtering #26

asg017 opened this issue Jun 21, 2024 · 7 comments

Comments

@asg017
Copy link
Owner

asg017 commented Jun 21, 2024

sqlite-vec doesn't have good metadata filtering as of v0.1.0. Only vector columns can be declared in the vec0 constructor. You can do pre-filtering with vec_column IN (...) queries, but that's slow and inconvenient.

I'm thinking:

create virtual table vec_movies(
  movie_id text primary key,
  genre text,
  release_date date,
  rating text,
  is_3d boolean,
  synopsis_embedding float[768]
);

genre, release_date, rating, and is_3d would all be "metadata" columns. You could do queries like:

select
  rowid,
  distance
from vec_movies
where synopsis_embedding match embed('comedic american summer camp')
  and k = 20
  and is_3d
  and release_date between '2010-01-01' and '2015-12-31'
  and rating = 'PG';

We could capture all the WHERE clauses to ensure that the top 20 returned vectors match that criteria.

A few open questions:

How do we store metadata values?

We could store in OLTP-fashion with the _rowids shadow tables, but that may be slow. We could store in column-oriented fashion to match the vector column formats, but unsure how much faster that would be.

How would this work with ANN indexes?

🤷

What datatypes to support?

Ideally everything, ideally STRICT. But if we do column-oriented we'd need a strict subset. like:

  • TEXT
  • INT
  • DOUBLE
  • BLOB
  • BOOLEAN
  • DATE/DATETIME

Maybe we could do dictionary encoding for text values? maybe that's a column option, like genre text encoding=dictionary or something. Maybe ENUMs? NULL/NOT NULL?

@asg017 asg017 pinned this issue Jun 23, 2024
@asg017 asg017 mentioned this issue Jul 24, 2024
@forrestbao
Copy link

I really wish this feature can be available soon.

@ajram23
Copy link

ajram23 commented Sep 24, 2024

@asg017 +1. Looks like langchain expects the metadata to be available as a dictionary. I have tried the integration and this is the last remaining piece to migrate fully out of ChromaDB.

@lojik-ng
Copy link

lojik-ng commented Oct 1, 2024

+1

@charnould
Copy link

@asg017 First, thanks for making sqlite-vec! SQLite is an amazing piece of software, and your extension a bless! pre-filtering is super important/useful for any RAG app. Do you have any ETA for this? Thanks again.

@asg017
Copy link
Owner Author

asg017 commented Oct 30, 2024

@charnould the PR is #124 but it needs tests and docs. Aiming for public release in 3 weeks (~Nov 19th) but hoping to get a beta out before then that you can try!

@ajram23
Copy link

ajram23 commented Oct 30, 2024

@asg017 happy to beta test when you release it!

@asg017
Copy link
Owner Author

asg017 commented Nov 21, 2024

As of v0.1.6, metadata columns are now supported in sqlite-vec! See the announcement blog post for more info, and reference the sqlite-vec metadata documentation for more details.

tldr you have a choice between storing metdata in vec0 virtual tables with either metadata column, partition keys, and auxiliary columns, all with their own benefits/drawbacks. Reference the blog post or docs for a more detailed look.

Closing this issue now, feel free to file more bugs/issues if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants