-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bounty] [pipe] [developer program] search #1188
Comments
/bounty 1000 |
💎 $1,000 bounty • Screenpi.peSteps to solve:
Thank you for contributing to mediar-ai/screenpipe! Add a bounty • Share on socials
|
/attempt #1188
|
What kind of embedding and vector databases do you want to implement for semantic searching? in my opinion open-ai embeddings and pinecone or chroma would be great. |
please first give us your plan, please read #1184 |
for semantic search we use nomic-embed-text through ollama atm, please consider implementing it in rust if u can, as with the database we need to minimize extra dependencies for users, and complication for the project overall. I'd go with something like SQLite-vss and build an incremental indexing on top of it, but it's up for discussion. |
you can propose other improvements, features in the search, have you used the pipe? |
Would like to give this a try. Process: Semantic Search:
Sample implementation of embedding on rust side can be found here
cc: @m13v for the review. |
@m13v i have similar plan, i used search pipe quite often i noticed some errors like offset is not working properly pipe freeze after increasing page size, and thinking about some ui changes like adding more filters if nessesry. as you mentioned we want very little dependency so, i'll write in native rust. tokenizer - |
guys:
|
guys @b4s36t4 @kumarvivek1752 , great suggestions overall, let's discuss a bit more, btw we have the end point for semantic search, but here is the current situation and projects (think of each project as a separate bounty): Project 1: live indexing of the data - we don't do any indexing atm, even though sqlite extensions support indexing, they don't support incremental indexing, which is crucial since we add new embeddings every few seconds, and users primarily interested in searching their latest data. so first we need to implement embedding in rust, then we need to create incremental indexing Project 2: embed and index the data you already have, it's mostly a UX thing, cause we don't want to freeze user computer for unknown period, and what if the user quite screenpipe in the middle of the process, and also we don't know how much data they already have, etc. i think this project is more for the future Project 3: fix current issues, like you mentioned @kumarvivek1752: page size issues, filter etc. happy to hear a more detailed plan here, but again this one will require you to test it as a user back and forth on some use case to make sure it works smoothly let me know what you want to work on or if you have other comments |
Hi, @m13v. Thanks for the feedback, as I have stated in my proposal I'm more leaning towards implementing the embedding in rust which I already implemented in one of my personal project, which did very pretty well job compared to running embedding model on wasm runtime with DOM. Please take a look at the implementation I have done, I have attached in the proposal comment. Let me know your thoughts on base implemenation. Regarding the incremental indexing, I'm not sure I understand it currently, if you shed some. more details on the exact details would be a lot helpful. |
@b4s36t4 yes we need to implement the embedding in rust, the example implementation looks good, but we need to use nomic-embed-text. sqlite vector embedding do support indexing, but if the embedding table is updated, we the indexing would redo the entire table, while we should have incremental indexing, ask claude: "does sqlite_vec extension support incremental indexing". and then ask "can you explain challenges of incremental indexing?" and then ask "how to implement incremental indexing for embeddings in sqlite" |
Hi, @m13v, Sorry for reverting such late. Yea, I think I'm good. I want to tackle the embedding and incremental update. I'm planning to use a scheduler job to process to process a 100 records at a time to index when the CPU usage is low (will need to monitor the usage for the indexing part so that we don't pressure cpu/gpu with the processing). Let me know what you think?, if it's ok I can get started. |
That’s great! A follow up question, how can we test usefulness of the
semantic search? Would u be able to test it with a user?
…On Thu, Jan 30, 2025 at 3:23 PM Mahesh Vagicherla ***@***.***> wrote:
Hi, @m13v <https://github.com/m13v>, Sorry for reverting such late. Yea,
I think I'm good. I want to tackle the embedding and incremental update.
I'm planning to use a scheduler job to process to process a 100 records at
a time to index when the CPU usage is low (will need to monitor the usage
for the indexing part so that we don't pressure cpu/gpu with the
processing).
Let me know what you think?, if it's ok I can get started.
—
Reply to this email directly, view it on GitHub
<#1188 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY62CDBF255ITGDZ7E7MK2T2NKYAVAVCNFSM6AAAAABVUBOLMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRVHEYDMOJUG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I can run the screenpipe for sometime in my system to populate some data into my sqlite, right? If that doesn't work maybe I can ask someone from community or so. |
[please read the developer program description https://github.com//issues/1184]
suggest other ideas, send us your plan, write comments, give feedback
The text was updated successfully, but these errors were encountered: