Pinecone fix vector search #150

henri123lemoine · 2023-08-22T15:00:23Z

This PR:

Adds many utils functions, most important of which are get_embeddings and co. get_embeddings calls the moderation endpoint before embedding whatever was not flagged, and returns the embeddings as well as the moderation results.
Deals with PINECONE_METADATA_KEYS and removes is_metadata_keys_equal since it was doing nothing.
Defines the session engine outside of make_session in order to reuse the same engine. Not sure this was necessary.
Adds 'force_update' tag to pinecone_update so that the user can update a source in the pinecone db despite its mysql articles having pinecone_update_required be false.
Adds finetuning. This includes 'finetuning_dataset.py' and 'training.py'. The former deals with the iterable dataset for the finetuning training, and the latter deals with the models and training process. This section has many remaining issues and needs to be improved substantially still. We should discuss issues with it at the meeting later today.
Adds query_vector, query_text, and get_embeddings_by_ids methods to the PineconeDB class. The first two are just querying the pinecone db, and the third gets embeddings from (full-)ids. Used in the finetuning process.
Adds the pinecone_models.py file, that includes the PineconeMetadata and PineconeEntry classes.
Changed update_pinecone.py quite a bit. This includes adding chunks that got flagged to the article's comments. I don't know if something else about the article should be changed when certain chunks are flagged; to discuss in today's meeting.
(Temporarily?) set the bias to 1 for all sources, so there's no favoring any source anymore. Biasing can be added later, but it's better to test it before just guessing a number and using it for the normal namespace.
Adds pyproject.toml file for black to default to 100-char lines. Many of the cosmetic changes are linked to that, unfortunately.

I realize it would have been better to split it into smaller branches.

I was getting errors at some point. Might be worth removing now

align_data/common/alignment_dataset.py

…tures

align_data/finetuning/finetuning_dataset.py

mruwnik

Looks good! At least the bits that I understood properly - I'll have to take your word for the models etc.

mruwnik · 2023-08-23T09:56:36Z

align_data/db/session.py

+def get_all_valid_article_ids(session: Session) -> List[str]:
+    """Return all valid article IDs."""
+    query_result = (
+        session.query(Article.id)


session.scalars should do the trick, without having to manually extract them

mruwnik · 2023-08-23T09:59:09Z

align_data/embeddings/finetuning/training.py

+    validation_dataset = FinetuningDataset(num_batches_per_epoch=BATCH_PER_EPOCH)
+    validation_dataloader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, num_workers=5)
+    best_val_loss = validate(model, validation_dataloader, criterion)
+    print(f"Initial validation loss (from loaded model or new model): {best_val_loss:.4f}")


[NIT] might be worth changing these prints to logger.info or whatever

mruwnik · 2023-08-23T10:00:41Z

align_data/embeddings/pinecone/pinecone_models.py

+                    authors=self.authors,
+                    url=self.url,
+                    date_published=self.date_published,
+                    text=self.text_chunks[i],


what about summaries? Will they also be embedded?

good point, we haven't considered that yet

It can be done in a later PR, though, if you want to merge this one

Although, from my understanding of it, if we did embed summaries, they would be their own PineconeEntry, so they would be added somewhere in PineconeUpdater's update(self, custom_sources: List[str], force_update: bool = False) function, right? Like, in my mind if we embed the summaries and add them as pinecone entries, they're treated like their own individual pinecone entry except for the fact that most of their fields have the same values as the article they are summarizing. If that's the case, I would imagine them created outside their PineconeEntry class.

If we wanted to have them be dealt automatically within PineconeEntry, one solution would be to add the article_summaries to get_text_chunks; this would mean each summary chunk would be seen as a chunk of the article, and embedded alongside it, but that seems a bit weak of a solution. Do you have other ideas?

Hmm, yeah I think it's worth merging this one and considering the embedding of summaries as a next step

* filter out empty values when merging dicts * handle invalid dates in arxiv vanity

* Bunch up blogs, special_docs and youtube * update readme to match bunched datasets --------- Co-authored-by: ccstan99 <[email protected]>

* fixed everything * changed add_comment to append_comment * use __init__ instead of __new__ in PineconeEntry * remove validator since it's dealt with by __init__ I suppose * add get_embeddings to the try block to deal with those errors

henri123lemoine and others added 30 commits August 15, 2023 23:40

added finetune_model.py

c3b3539

updated finetune_model.py

7bd986a

added finetune_embeddings_tests to main for testing

877bcfe

set up iterable dataset for finetuning

1e444cd

Merge remote-tracking branch 'origin' into finetune-embeddings

03d0d54

minor refactor

041d48f

added model.py

abee3f8

added get_embeddings_by_ids method to pinecone handler

c41de70

simplified update_pinecone.py

5a629cb

added (probably unnecessary) check on text splitter

a0977fb

I was getting errors at some point. Might be worth removing now

added dataset.py

5472613

added utils functions

27629a8

updated settings

83b6f79

(incorrectly) trained finetuning layer. Best so far, but BAD

8ac789f

changed the session maker to reuse engines

b5c0457

added train_finetuning_layer method to main

b65db76

added youtube api key to env.example

7458b4d

simplified dataset.py

891808d

update_pinecone refactor

d2b0ff7

added force_update to pinecone_update methods

706aa9b

small refactor+removed bias+set namespace

57dc4e7

added query method to pineconedb

141ded5

Merge remote-tracking branch 'origin/main' into finetune-embeddings

950b69f

minor refactor

15e4bec

reformat chunk headers

f23d3b1

changes_to_pinecone

840ca8c

added moderation to get_embeddings util

faff6cd

remove duplicate function

a749248

renamed embed_query to get_embedding

ddb5f8e

black reformatting

075d409

Thomas-Lemoine added 2 commits August 22, 2023 16:49

moved embedding utils out of common/utils.py

05bf335

hf_embeddings slight refactor

513b87a

Thomas-Lemoine reviewed Aug 22, 2023

View reviewed changes

align_data/common/alignment_dataset.py Show resolved Hide resolved

Thomas-Lemoine added 2 commits August 22, 2023 17:12

engine rename and autoflush inside session init for better type signa…

03f641b

…tures

simplified openai error types

b6a8c8c

Thomas-Lemoine reviewed Aug 22, 2023

View reviewed changes

align_data/finetuning/finetuning_dataset.py Outdated Show resolved Hide resolved

henri123lemoine and others added 10 commits August 22, 2023 19:06

restructured finetuning and pinecone dirs

c229cc2

added auto code formatting with black on push/PR

afe91af

Testing pre-commit hook

ed034da

PR fixes

e6864b7

moved text_splitter.py, fixed minor typing issues

d5b7657

minor bug fix

bee0106

minor typing and imports bug-fix

78812f1

simplify get random chunks

3d8a9d9

removed EmbeddingType

18fe229

moved sources tests to test/align_data/sources

78192e3

mruwnik previously approved these changes Aug 23, 2023

View reviewed changes

mruwnik and others added 5 commits August 23, 2023 07:35

Tidy up (#144)

b44fe88

* filter out empty values when merging dicts * handle invalid dates in arxiv vanity

disable pinecone encoding in actions (#148)

ebcbeec

Transformer-circuits blog (#146)

0b940d9

write to correct database when updating metadata (#149)

e7e19b9

Bunch up blogs, special_docs and youtube (#147)

42e75f0

* Bunch up blogs, special_docs and youtube * update readme to match bunched datasets --------- Co-authored-by: ccstan99 <[email protected]>

henri123lemoine dismissed mruwnik’s stale review via 42e75f0 August 23, 2023 11:35

Merge branch 'main' into pinecone-fix-vector-search

cc9df48

Thomas-Lemoine previously approved these changes Aug 23, 2023

View reviewed changes

skip entries with falsy field values (#154)

176b052

* fixed everything * changed add_comment to append_comment * use __init__ instead of __new__ in PineconeEntry * remove validator since it's dealt with by __init__ I suppose * add get_embeddings to the try block to deal with those errors

Thomas-Lemoine dismissed their stale review via 176b052 August 23, 2023 11:59

Thomas-Lemoine approved these changes Aug 23, 2023

View reviewed changes

henri123lemoine merged commit c9ceb24 into main Aug 23, 2023
1 check passed

henri123lemoine deleted the pinecone-fix-vector-search branch August 23, 2023 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pinecone fix vector search #150

Pinecone fix vector search #150

henri123lemoine commented Aug 22, 2023

mruwnik left a comment

mruwnik Aug 23, 2023

mruwnik Aug 23, 2023

mruwnik Aug 23, 2023

Thomas-Lemoine Aug 23, 2023

mruwnik Aug 23, 2023

Thomas-Lemoine Aug 23, 2023

Thomas-Lemoine Aug 23, 2023

Pinecone fix vector search #150

Pinecone fix vector search #150

Conversation

henri123lemoine commented Aug 22, 2023

mruwnik left a comment

Choose a reason for hiding this comment

mruwnik Aug 23, 2023

Choose a reason for hiding this comment

mruwnik Aug 23, 2023

Choose a reason for hiding this comment

mruwnik Aug 23, 2023

Choose a reason for hiding this comment

Thomas-Lemoine Aug 23, 2023

Choose a reason for hiding this comment

mruwnik Aug 23, 2023

Choose a reason for hiding this comment

Thomas-Lemoine Aug 23, 2023

Choose a reason for hiding this comment

Thomas-Lemoine Aug 23, 2023

Choose a reason for hiding this comment