Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bounty] [pipe] [developer program] search #1188

Open
m13v opened this issue Jan 22, 2025 · 16 comments
Open

[bounty] [pipe] [developer program] search #1188

m13v opened this issue Jan 22, 2025 · 16 comments
Labels
💎 Bounty enhancement New feature or request

Comments

@m13v
Copy link
Contributor

m13v commented Jan 22, 2025

[please read the developer program description https://github.com//issues/1184]

  • improve keyword-based search functionality.
  • add semantic search capabilities for real-time and retrospective data

suggest other ideas, send us your plan, write comments, give feedback

@m13v m13v added the enhancement New feature or request label Jan 22, 2025
@m13v
Copy link
Contributor Author

m13v commented Jan 22, 2025

/bounty 1000

Copy link

algora-pbc bot commented Jan 22, 2025

💎 $1,000 bounty • Screenpi.pe

Steps to solve:

  1. Start working: Comment /attempt #1188 with your implementation plan
  2. Submit work: Create a pull request including /claim #1188 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to mediar-ai/screenpipe!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🟢 @kumarvivek1752 Jan 22, 2025, 2:24:45 PM WIP
🟢 @b4s36t4 Jan 22, 2025, 8:50:23 PM WIP

@kumarvivek1752
Copy link
Contributor

kumarvivek1752 commented Jan 22, 2025

/attempt #1188

Algora profile Completed bounties Tech Active attempts Options
@kumarvivek1752 1 mediar-ai bounty
HTML, TypeScript,
CSS & more
Cancel attempt

@kumarvivek1752
Copy link
Contributor

What kind of embedding and vector databases do you want to implement for semantic searching? in my opinion open-ai embeddings and pinecone or chroma would be great.

@m13v
Copy link
Contributor Author

m13v commented Jan 22, 2025

please first give us your plan, please read #1184

@m13v
Copy link
Contributor Author

m13v commented Jan 22, 2025

for semantic search we use nomic-embed-text through ollama atm, please consider implementing it in rust if u can, as with the database we need to minimize extra dependencies for users, and complication for the project overall. I'd go with something like SQLite-vss and build an incremental indexing on top of it, but it's up for discussion.

@m13v
Copy link
Contributor Author

m13v commented Jan 22, 2025

you can propose other improvements, features in the search, have you used the pipe?

@b4s36t4
Copy link
Contributor

b4s36t4 commented Jan 22, 2025

Would like to give this a try.
/attempt #1188

Process:

Semantic Search:
- Tokenizer - Embedding require a custom tokenizer, can be fetched from hf_hub and should be loaded to tokenizer lib.
- Inference - Use ort runtime to load the onnx model and generate embeddings.
Embeddings are generated for each token, we need to do mean pooling to generate a combined embeddings for whole text.
(uses ndarray to generate tokens or mean pooling etc, can be taken inspiration from transformers.js.)
- sqlite_vec extension to support vector db support for sqlite, (Personally tested and it's excellent.)
Need to verify weather it works with sqlx or not.
- Generate a new endpoint (maybe called /semantic) to keep the current search not breaking or to support more features with vector DB.
- /semantic endpoint will simply embed the query passed to request and searches in DB.

We already have sqlite_vec extension loaded for the sqlx which we can simply user.

Sample implementation of embedding on rust side can be found here

Algora profile Completed bounties Tech Active attempts Options
@b4s36t4    1 mediar-ai bounty
+ 3 bounties from 2 projects
TypeScript, Rust,
JavaScript & more
﹟1106, ﹟1190
Cancel attempt

cc: @m13v for the review.

@kumarvivek1752
Copy link
Contributor

@m13v i have similar plan, i used search pipe quite often i noticed some errors like offset is not working properly pipe freeze after increasing page size, and thinking about some ui changes like adding more filters if nessesry.

as you mentioned we want very little dependency so, i'll write in native rust.
and use:

tokenizer - hf_hub
embeddings - nomic-embed-text
vector database -sqlite-vss

@louis030195
Copy link
Collaborator

guys:

  1. suggest mockups of the end user experience
  2. think about the technical details later (PS: can you read the code before suggesting things? we already have what you suggest mostly)

@m13v
Copy link
Contributor Author

m13v commented Jan 24, 2025

guys @b4s36t4 @kumarvivek1752 , great suggestions overall, let's discuss a bit more, btw we have the end point for semantic search, but here is the current situation and projects (think of each project as a separate bounty):

Project 1: live indexing of the data - we don't do any indexing atm, even though sqlite extensions support indexing, they don't support incremental indexing, which is crucial since we add new embeddings every few seconds, and users primarily interested in searching their latest data. so first we need to implement embedding in rust, then we need to create incremental indexing

Project 2: embed and index the data you already have, it's mostly a UX thing, cause we don't want to freeze user computer for unknown period, and what if the user quite screenpipe in the middle of the process, and also we don't know how much data they already have, etc. i think this project is more for the future

Project 3: fix current issues, like you mentioned @kumarvivek1752: page size issues, filter etc. happy to hear a more detailed plan here, but again this one will require you to test it as a user back and forth on some use case to make sure it works smoothly

let me know what you want to work on or if you have other comments

@b4s36t4
Copy link
Contributor

b4s36t4 commented Jan 24, 2025

Hi, @m13v. Thanks for the feedback, as I have stated in my proposal I'm more leaning towards implementing the embedding in rust which I already implemented in one of my personal project, which did very pretty well job compared to running embedding model on wasm runtime with DOM.

Please take a look at the implementation I have done, I have attached in the proposal comment. Let me know your thoughts on base implemenation.

Regarding the incremental indexing, I'm not sure I understand it currently, if you shed some. more details on the exact details would be a lot helpful.

@m13v
Copy link
Contributor Author

m13v commented Jan 25, 2025

@b4s36t4 yes we need to implement the embedding in rust, the example implementation looks good, but we need to use nomic-embed-text. sqlite vector embedding do support indexing, but if the embedding table is updated, we the indexing would redo the entire table, while we should have incremental indexing, ask claude: "does sqlite_vec extension support incremental indexing". and then ask "can you explain challenges of incremental indexing?" and then ask "how to implement incremental indexing for embeddings in sqlite"

@b4s36t4
Copy link
Contributor

b4s36t4 commented Jan 30, 2025

Hi, @m13v, Sorry for reverting such late. Yea, I think I'm good. I want to tackle the embedding and incremental update.

I'm planning to use a scheduler job to process to process a 100 records at a time to index when the CPU usage is low (will need to monitor the usage for the indexing part so that we don't pressure cpu/gpu with the processing).

Let me know what you think?, if it's ok I can get started.

@m13v
Copy link
Contributor Author

m13v commented Feb 1, 2025 via email

@b4s36t4
Copy link
Contributor

b4s36t4 commented Feb 2, 2025

I can run the screenpipe for sometime in my system to populate some data into my sqlite, right? If that doesn't work maybe I can ask someone from community or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💎 Bounty enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants