Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison to other approaches #94

Open
davidmezzetti opened this issue Sep 1, 2024 · 3 comments
Open

Comparison to other approaches #94

davidmezzetti opened this issue Sep 1, 2024 · 3 comments

Comments

@davidmezzetti
Copy link

Hi @asg017 - Thank you for putting this library together. Storing and searching vectors directly in SQLite is a really important use case.

I believe you've come across txtai. I'm looking to add SQLite vector indexing in the next release. I've added a component for sqlite-vec and it works quite well. This enables a unified vector search with a single SQL query (i.e. join the data with the vectors) vs the current approach of having to join external indexes like Faiss with data stored in a database.

sqlite-vec has received good traction, congratulations on that. I also see there are other options out there and I was curious on your thoughts.

  • Vectorlite - Seems similar to your previous sqlite-vss library of using an external vector library and storing the index outside of the database
  • libSQL - Fork of SQLite that appears to have vector search built-in
  • SQLite - Do we think at some point something is natively added to SQLite to handle vectors/vector search?

I see you're working on having indexing in #25, which is great. Do you envision that the indexes will somehow be directly stored in the SQLite database or in an external file? Or is that still TBD?

I appreciate any insights you can provide.

@asg017
Copy link
Owner

asg017 commented Sep 1, 2024

hey @davidmezzetti thanks for the kind words! Happy to help review any sqlite-vec related PRs in other open source projects, just give me a ping.

Here are my thoughts on other tools:

Vectorlite: I haven't spent much time playing with this, but it seems great! Do note that the hnswlib and other C++ dependencies make it harder to compile and run on some platforms (WASM, smaller devices, etc.), but that probably doesn't matter much in the Python world. They also store vector indexes outside the SQLite database (in separate files), which is good for performance but has other not-so-great side-effects (may not be transaction-safe, doesn't work with SQLite backup and replication tools, deviates from SQLite's "it's a single file" ethos, etc). Also their benchmarks for sqlite-vec can be improved a bit, I'll probably send a PR to that repo sometime in the near future. Specifically insert/query performance can be improved greatly with a few PRAGMA configurations

libSQL's vector support: Also great, models after pgvector, which uses custom indexes that are not available in vanilla SQLite. But libsql folks can change whatever they want in their fork. It does require using a the libsql driver instead of the default sqlite driver, but that's usually a pip install away. In my experience the libsql vector indexes are a bit slow to insert and query from but YMMV. I'd say try both and see what you like the most.

SQLite: I don't think SQLite will ever add vector support, but I'd be happy to see it. they'll never add new custom column types besides the builtin text/blob/int/float types (so no libsql-like support), but they could possibly add an extension like their JSON/FTS5/R-Tree extensions for vector search. That being said, none of their code use SIMD operations, so I doubt it would be as fast as sqlite-vec. And if they do add something, it usually takes years for their custom extensions to make it into the core SQLite amalgamation, so it may take time. That being said, the SQLite team has surprised me in the past, so I hope they try it out.

For ANN indexes: they 100% will be stored inside the SQLite database. That way backups work, litestream replication works, and transactions continue as-is. I could maybe see a future where indexes are stored in separate files for performance reasons (an opt-in feature), but that would be far down the road.

Let me know if there's anything else I can answer, or email me directly if you wanna chat on a call

@davidmezzetti
Copy link
Author

I appreciate the quick response. I'm glad to hear that the ANN indexes will be stored inside the SQLite database.

One other thing that came to mind regarding LIMIT and K. Let's say I have another table for document content and filterable fields. If I run a query joining those together and filter on fields from both the document table and vector table, would the query return all results up to LIMIT? Or does the LIMIT OR K clause first filter the vector table?

SELECT d.* FROM documents d 
LEFT JOIN vectors v ON v.indexid = d.indexid
WHERE d.flag=1 AND v.embedding MATCH ?
LIMIT 100

@davidmezzetti
Copy link
Author

I added sqlite-vec as an ANN to txtai here - neuml/txtai#780

A future task will integrate this even further into a single SQLite data file with content and vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants