Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Message embedding #540

Merged
merged 13 commits into from
Jan 9, 2023
Merged

Store Message embedding #540

merged 13 commits into from
Jan 9, 2023

Conversation

nil-andreu
Copy link
Contributor

Create message embeddings and store them in the DB.

Related to this issue.

@nil-andreu nil-andreu changed the title Message embedding Store Message embedding Jan 8, 2023
@nil-andreu nil-andreu marked this pull request as ready for review January 8, 2023 20:30
Copy link
Collaborator

@yk yk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, thank you. I've left a few comments. my biggest worry here is about the question what happens if the text is too long. it might not "fail", i.e. we might get a result, but the model might not be trained for that length, so it might actually be garbage.

@@ -41,6 +51,10 @@ async def post(self, input: str) -> Any:
async with session.post(self.api_url, headers=self.headers, json=payload) as response:
# If we get a bad response
if response.status != 200:
from loguru import logger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import things at top level

@@ -95,6 +95,7 @@ services:
- DEBUG_SKIP_API_KEY_CHECK=True
- DEBUG_USE_SEED_DATA=True
- MAX_WORKERS=1
- DEBUG_SKIP_EMBEDDING_COMPUTATION=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also add this to the ansible playbook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When deployed via ansible the fetching of the embeddings should not be skipped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When should be skipped? Only in local-run.sh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I asked again... probably also won't be necessary for Frontend development..

@@ -41,6 +51,10 @@ async def post(self, input: str) -> Any:
async with session.post(self.api_url, headers=self.headers, json=payload) as response:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know what happens if the text is longer than the intended text length of the model?

Copy link
Contributor Author

@nil-andreu nil-andreu Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this.
If there is a certain sentence length pre-defined in the model, the extra words will be truncated.

And the arch of our model is the following:

SentenceTransformer(
     (0): Transformer({
            'max_seq_length': 128, 
            'do_lower_case': False
            }) with Transformer model: BertModel 
     (1): Pooling({
           'word_embedding_dimension': 384, 
           'pooling_mode_cls_token': False, 
           'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 
           'pooling_mode_mean_sqrt_len_tokens': False})
)

So would have a sentence length of 128 words. The impact on us will depend then if there is an important amount of messages that have > 128 words.

Related Issue here.

backend/oasst_backend/utils/hugging_face.py Show resolved Hide resolved
backend/oasst_backend/utils/hugging_face.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Added my opinion regarding model-id

recommended changes:

  • remove 2x space
  • upper camel case enum name HfUrl

backend/oasst_backend/api/v1/tasks.py Outdated Show resolved Hide resolved
backend/oasst_backend/utils/hugging_face.py Outdated Show resolved Hide resolved
@nil-andreu
Copy link
Contributor Author

  • Added my opinion regarding model-id

Yes, all of the text processing is done by the endpoint. And we are also storing which is the version that we are using of that model.

recommended changes:

  • remove 2x space
  • upper camel case enum name HfUrl
  • Added my opinion regarding model-id

recommended changes:

  • remove 2x space
  • upper camel case enum name HfUrl

Okay, have applied also those changes.

Copy link
Collaborator

@yk yk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good from my side 👍

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks a lot! :-)

@yk yk merged commit ba12c35 into LAION-AI:main Jan 9, 2023
@nil-andreu
Copy link
Contributor Author

Looks good, thanks a lot! :-)

Thankss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants