Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError When Adding New Documents to Existing Index via IndexUpdater #369

Open
suhaib1769 opened this issue Sep 26, 2024 · 0 comments

Comments

@suhaib1769
Copy link

Hello
I am using this library for the first time and running into an issue. I am trying to batch a large data set into smaller chunks and then index them in one all in one go. I get an AssertionError when adding new documents to an existing index using the IndexUpdater class. I can successfully create an initial index, but the error occurs when I try adding more data using the IndexUpdater.

Here is a simplified version of the code I am using:

                if first_batch:
                    with Run().context(RunConfig(nranks=1, experiment='first_batch')):
                        config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)
                        indexer = Indexer(checkpoint=checkpoint, config=config)
                        indexer.index(name=index_name, collection=batch_data, overwrite=True)
                    first_batch_processed = True
                if not first_batch: # add new data in batches - this repeats a few times
                    with Run().context(RunConfig(experiment='notebook')):
                        searcher = Searcher(index=index_name)  # Load the existing index
                        index_updater = IndexUpdater(config=ColBERTConfig(), searcher=searcher, checkpoint=checkpoint)
                        index_updater.add(batch_data[:10])  # Add batch to the index
                        index_updater.persist_to_disk()  # Persist the changes to disk

But I get the following error:

{
"name": "AssertionError",
"message": "",
"stack": "---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[14], line 3
1 # Run the function to load and combine data in batches
2 start = time.time()
----> 3 data = embed_and_index('/home/data', batch_size=100)
4 end = time.time()
5 print(f"Time taken: {end - start} seconds")

Cell In[13], line 106, in embed_and_index(parsed_directory, batch_size, limit)
104 index_updater = IndexUpdater(config=ColBERTConfig(), searcher=searcher, checkpoint=checkpoint)
105 print("Index updater created")
--> 106 index_updater.add(batch_data[:10]) # Add batch to the index
107 print("Batch added")
108 index_updater.persist_to_disk() # Persist the changes to disk

File ~/index_updater.py:170, in IndexUpdater.add(self, passages)
167 curr_pid = start_pid
169 compressed_embs, doclens = self.create_embs_and_doclens(passages)
--> 170 self.update_searcher(compressed_embs, doclens, curr_pid)
172 print_message(f"#> Added {len(passages)} passages from pid {start_pid}.")
173 new_pids = list(range(start_pid, start_pid + len(passages)))

File ~/index_updater.py:129, in IndexUpdater.update_searcher(self, compressed_embs, doclens, curr_pid)
127 codes = compressed_embs.codes[start:end]
128 partitions, _ = self._build_passage_partitions(codes)
--> 129 ivf, ivf_lengths = self._add_pid_to_ivf(partitions, curr_pid, ivf, ivf_lengths)
131 start = end
132 curr_pid += 1

File ~/index_updater.py:418, in IndexUpdater._add_pid_to_ivf(self, partitions, pid, old_ivf, old_ivf_lengths)
415 new_ivf_lengths[-1] += 1
416 partitions_runner += 1
--> 418 assert ivf_runner == len(old_ivf)
419 assert sum(new_ivf_lengths) == len(new_ivf)
421 return new_ivf, new_ivf_lengths

AssertionError: "
}

Do you know how I can resolve this? Does the batch size or other parameters (e.g., document length, embedding size) need to be adjusted when adding new documents to an existing index? Are there specific constraints to consider when initializing the IndexUpdater class to add new documents?

My goal is to batch the data into groups and index one at a time into the same index - is there another way to batch and index data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant