Compute text-embeddings for incoming meassges via HF feature-extraction pipeline #507

andreaskoepf · 2023-01-07T22:05:53Z

We want to store an embedding together with each message in the DB to measure similarity and diversity (e.g. to detect (near-)duplicates).

Select a model to use for the embedding-calculation e.g. see https://huggingface.co/models?pipeline_tag=feature-extraction&sort=downloads (MiniLM & LaBSE were mentioned in internal discussions), potentially discuss with ML-Team on discord, a multilingual model would be preferred, in doubt choose a popular one
Find way to store the embedding vector as postgres array via SQLModel (maybe like shown here?) and add a new (nullable) <short_modelname>_embedding column to store the embedding of message-text, create alembic update script
Use the HuggingFaceAPI class to make an asycn web call for each incoming message and store the embedding in the db. in case of an exception store NULL in the embedding field (successfully store the message anyway).
Create a new debug-flag in the backend settings class (default False) that allows to disable the embedding-calculations. Se the env-variable to True in the scripts/backend_development/run-local.sh script.

(Non-collaborators: Please leave a comment if you want to work on this task. Someone will then assign the task to you.)

The text was updated successfully, but these errors were encountered:

jojopirker · 2023-01-08T09:27:35Z

I would take a look at this!

Would it make sense to save the embeddings in a new table? My thinking is that with a new table with the columns message_id, model_name & embedding we could simply store multiple embeddings and experiment with different models.

nil-andreu · 2023-01-08T10:23:38Z

I think I could also take a look at this one, as it is related to classification of the messages in HF.

olliestanley · 2023-01-08T11:00:22Z

I would take a look at this!

Would it make sense to save the embeddings in a new table? My thinking is that with a new table with the columns message_id, model_name & embedding we could simply store multiple embeddings and experiment with different models.

Having a new table would make sense to me to minimise schema changes on new models

huu4ontocord · 2023-01-08T18:13:36Z

@SummerSigh see this issue. Similar to embedders we are building for safety. Let's all keep in contact re this so we can cross use stuff @jojopirker.

jojopirker · 2023-01-08T18:55:28Z

I'll ping you guys in the discord channel :) @ontocord

SummerSigh · 2023-01-08T20:31:26Z

@ontocord Ok! Sounds good!

yk · 2023-01-10T20:27:18Z

if I understand correctly, this was solved in #540

andreaskoepf added backend ml labels Jan 7, 2023

andreaskoepf added this to the Minimum Viable Prototype milestone Jan 7, 2023

andreaskoepf added this to Open-Assistant Jan 7, 2023

github-project-automation bot moved this to 📫 Triage in Open-Assistant Jan 7, 2023

olliestanley assigned jojopirker Jan 8, 2023

nil-andreu mentioned this issue Jan 8, 2023

Store Message embedding #540

Merged

andreaskoepf moved this from 📫 Triage to ⚙ In Progress in Open-Assistant Jan 9, 2023

yk closed this as completed Jan 10, 2023

github-project-automation bot moved this from ⚙ In Progress to ✅ Done in Open-Assistant Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute text-embeddings for incoming meassges via HF feature-extraction pipeline #507

Compute text-embeddings for incoming meassges via HF feature-extraction pipeline #507

andreaskoepf commented Jan 7, 2023 •

edited

Loading

jojopirker commented Jan 8, 2023

nil-andreu commented Jan 8, 2023

olliestanley commented Jan 8, 2023

huu4ontocord commented Jan 8, 2023 •

edited

Loading

jojopirker commented Jan 8, 2023

SummerSigh commented Jan 8, 2023

yk commented Jan 10, 2023

Compute text-embeddings for incoming meassges via HF feature-extraction pipeline #507

Compute text-embeddings for incoming meassges via HF feature-extraction pipeline #507

Comments

andreaskoepf commented Jan 7, 2023 • edited Loading

jojopirker commented Jan 8, 2023

nil-andreu commented Jan 8, 2023

olliestanley commented Jan 8, 2023

huu4ontocord commented Jan 8, 2023 • edited Loading

jojopirker commented Jan 8, 2023

SummerSigh commented Jan 8, 2023

yk commented Jan 10, 2023

andreaskoepf commented Jan 7, 2023 •

edited

Loading

huu4ontocord commented Jan 8, 2023 •

edited

Loading