-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pinecone fix vector search #150
Conversation
I was getting errors at some point. Might be worth removing now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! At least the bits that I understood properly - I'll have to take your word for the models etc.
def get_all_valid_article_ids(session: Session) -> List[str]: | ||
"""Return all valid article IDs.""" | ||
query_result = ( | ||
session.query(Article.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
session.scalars
should do the trick, without having to manually extract them
validation_dataset = FinetuningDataset(num_batches_per_epoch=BATCH_PER_EPOCH) | ||
validation_dataloader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, num_workers=5) | ||
best_val_loss = validate(model, validation_dataloader, criterion) | ||
print(f"Initial validation loss (from loaded model or new model): {best_val_loss:.4f}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[NIT] might be worth changing these prints to logger.info
or whatever
authors=self.authors, | ||
url=self.url, | ||
date_published=self.date_published, | ||
text=self.text_chunks[i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about summaries? Will they also be embedded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, we haven't considered that yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be done in a later PR, though, if you want to merge this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, from my understanding of it, if we did embed summaries, they would be their own PineconeEntry, so they would be added somewhere in PineconeUpdater's update(self, custom_sources: List[str], force_update: bool = False) function, right? Like, in my mind if we embed the summaries and add them as pinecone entries, they're treated like their own individual pinecone entry except for the fact that most of their fields have the same values as the article they are summarizing. If that's the case, I would imagine them created outside their PineconeEntry class.
If we wanted to have them be dealt automatically within PineconeEntry, one solution would be to add the article_summaries to get_text_chunks; this would mean each summary chunk would be seen as a chunk of the article, and embedded alongside it, but that seems a bit weak of a solution. Do you have other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah I think it's worth merging this one and considering the embedding of summaries as a next step
* Bunch up blogs, special_docs and youtube * update readme to match bunched datasets --------- Co-authored-by: ccstan99 <[email protected]>
* fixed everything * changed add_comment to append_comment * use __init__ instead of __new__ in PineconeEntry * remove validator since it's dealt with by __init__ I suppose * add get_embeddings to the try block to deal with those errors
This PR:
get_embeddings
and co.get_embeddings
calls the moderation endpoint before embedding whatever was not flagged, and returns the embeddings as well as the moderation results.is_metadata_keys_equal
since it was doing nothing.query_vector
,query_text
, andget_embeddings_by_ids
methods to the PineconeDB class. The first two are just querying the pinecone db, and the third gets embeddings from (full-)ids. Used in the finetuning process.pinecone_models.py
file, that includes the PineconeMetadata and PineconeEntry classes.update_pinecone.py
quite a bit. This includes adding chunks that got flagged to the article's comments. I don't know if something else about the article should be changed when certain chunks are flagged; to discuss in today's meeting.I realize it would have been better to split it into smaller branches.