Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinecone fix vector search #150

Merged
merged 57 commits into from
Aug 23, 2023
Merged

Pinecone fix vector search #150

merged 57 commits into from
Aug 23, 2023

Conversation

henri123lemoine
Copy link
Collaborator

This PR:

  • Adds many utils functions, most important of which are get_embeddings and co. get_embeddings calls the moderation endpoint before embedding whatever was not flagged, and returns the embeddings as well as the moderation results.
  • Deals with PINECONE_METADATA_KEYS and removes is_metadata_keys_equal since it was doing nothing.
  • Defines the session engine outside of make_session in order to reuse the same engine. Not sure this was necessary.
  • Adds 'force_update' tag to pinecone_update so that the user can update a source in the pinecone db despite its mysql articles having pinecone_update_required be false.
  • Adds finetuning. This includes 'finetuning_dataset.py' and 'training.py'. The former deals with the iterable dataset for the finetuning training, and the latter deals with the models and training process. This section has many remaining issues and needs to be improved substantially still. We should discuss issues with it at the meeting later today.
  • Adds query_vector, query_text, and get_embeddings_by_ids methods to the PineconeDB class. The first two are just querying the pinecone db, and the third gets embeddings from (full-)ids. Used in the finetuning process.
  • Adds the pinecone_models.py file, that includes the PineconeMetadata and PineconeEntry classes.
  • Changed update_pinecone.py quite a bit. This includes adding chunks that got flagged to the article's comments. I don't know if something else about the article should be changed when certain chunks are flagged; to discuss in today's meeting.
  • (Temporarily?) set the bias to 1 for all sources, so there's no favoring any source anymore. Biasing can be added later, but it's better to test it before just guessing a number and using it for the normal namespace.
  • Adds pyproject.toml file for black to default to 100-char lines. Many of the cosmetic changes are linked to that, unfortunately.

I realize it would have been better to split it into smaller branches.

mruwnik
mruwnik previously approved these changes Aug 23, 2023
Copy link
Collaborator

@mruwnik mruwnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! At least the bits that I understood properly - I'll have to take your word for the models etc.

def get_all_valid_article_ids(session: Session) -> List[str]:
"""Return all valid article IDs."""
query_result = (
session.query(Article.id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session.scalars should do the trick, without having to manually extract them

validation_dataset = FinetuningDataset(num_batches_per_epoch=BATCH_PER_EPOCH)
validation_dataloader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, num_workers=5)
best_val_loss = validate(model, validation_dataloader, criterion)
print(f"Initial validation loss (from loaded model or new model): {best_val_loss:.4f}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] might be worth changing these prints to logger.info or whatever

authors=self.authors,
url=self.url,
date_published=self.date_published,
text=self.text_chunks[i],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about summaries? Will they also be embedded?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, we haven't considered that yet

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be done in a later PR, though, if you want to merge this one

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, from my understanding of it, if we did embed summaries, they would be their own PineconeEntry, so they would be added somewhere in PineconeUpdater's update(self, custom_sources: List[str], force_update: bool = False) function, right? Like, in my mind if we embed the summaries and add them as pinecone entries, they're treated like their own individual pinecone entry except for the fact that most of their fields have the same values as the article they are summarizing. If that's the case, I would imagine them created outside their PineconeEntry class.

If we wanted to have them be dealt automatically within PineconeEntry, one solution would be to add the article_summaries to get_text_chunks; this would mean each summary chunk would be seen as a chunk of the article, and embedded alongside it, but that seems a bit weak of a solution. Do you have other ideas?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I think it's worth merging this one and considering the embedding of summaries as a next step

mruwnik and others added 5 commits August 23, 2023 07:35
* filter out empty values when merging dicts

* handle invalid dates in arxiv vanity
* Bunch up blogs, special_docs and youtube

* update readme to match bunched datasets

---------

Co-authored-by: ccstan99 <[email protected]>
Thomas-Lemoine
Thomas-Lemoine previously approved these changes Aug 23, 2023
* fixed everything

* changed add_comment to append_comment

* use __init__ instead of __new__ in PineconeEntry

* remove validator since it's dealt with by __init__ I suppose

* add get_embeddings to the try block to deal with those errors
@henri123lemoine henri123lemoine merged commit c9ceb24 into main Aug 23, 2023
1 check passed
@henri123lemoine henri123lemoine deleted the pinecone-fix-vector-search branch August 23, 2023 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants