Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute text-embeddings for incoming meassges via HF feature-extraction pipeline #507

Closed
andreaskoepf opened this issue Jan 7, 2023 · 7 comments
Assignees

Comments

@andreaskoepf
Copy link
Collaborator

andreaskoepf commented Jan 7, 2023

We want to store an embedding together with each message in the DB to measure similarity and diversity (e.g. to detect (near-)duplicates).

  1. Select a model to use for the embedding-calculation e.g. see https://huggingface.co/models?pipeline_tag=feature-extraction&sort=downloads (MiniLM & LaBSE were mentioned in internal discussions), potentially discuss with ML-Team on discord, a multilingual model would be preferred, in doubt choose a popular one
  2. Find way to store the embedding vector as postgres array via SQLModel (maybe like shown here?) and add a new (nullable) <short_modelname>_embedding column to store the embedding of message-text, create alembic update script
  3. Use the HuggingFaceAPI class to make an asycn web call for each incoming message and store the embedding in the db. in case of an exception store NULL in the embedding field (successfully store the message anyway).
  4. Create a new debug-flag in the backend settings class (default False) that allows to disable the embedding-calculations. Se the env-variable to True in the scripts/backend_development/run-local.sh script.

(Non-collaborators: Please leave a comment if you want to work on this task. Someone will then assign the task to you.)

@jojopirker
Copy link
Contributor

I would take a look at this!

Would it make sense to save the embeddings in a new table? My thinking is that with a new table with the columns message_id, model_name & embedding we could simply store multiple embeddings and experiment with different models.

@nil-andreu
Copy link
Contributor

I think I could also take a look at this one, as it is related to classification of the messages in HF.

@olliestanley
Copy link
Collaborator

I would take a look at this!

Would it make sense to save the embeddings in a new table? My thinking is that with a new table with the columns message_id, model_name & embedding we could simply store multiple embeddings and experiment with different models.

Having a new table would make sense to me to minimise schema changes on new models

@huu4ontocord
Copy link
Collaborator

huu4ontocord commented Jan 8, 2023

@SummerSigh see this issue. Similar to embedders we are building for safety. Let's all keep in contact re this so we can cross use stuff @jojopirker.

@jojopirker
Copy link
Contributor

I'll ping you guys in the discord channel :) @ontocord

@SummerSigh
Copy link
Collaborator

@ontocord Ok! Sounds good!

@andreaskoepf andreaskoepf moved this from 📫 Triage to ⚙ In Progress in Open-Assistant Jan 9, 2023
@yk
Copy link
Collaborator

yk commented Jan 10, 2023

if I understand correctly, this was solved in #540

@yk yk closed this as completed Jan 10, 2023
@github-project-automation github-project-automation bot moved this from ⚙ In Progress to ✅ Done in Open-Assistant Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

7 participants