AssertionError When Adding New Documents to Existing Index via IndexUpdater #369

suhaib1769 · 2024-09-26T10:17:25Z

Hello
I am using this library for the first time and running into an issue. I am trying to batch a large data set into smaller chunks and then index them in one all in one go. I get an AssertionError when adding new documents to an existing index using the IndexUpdater class. I can successfully create an initial index, but the error occurs when I try adding more data using the IndexUpdater.

Here is a simplified version of the code I am using:

                if first_batch:
                    with Run().context(RunConfig(nranks=1, experiment='first_batch')):
                        config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)
                        indexer = Indexer(checkpoint=checkpoint, config=config)
                        indexer.index(name=index_name, collection=batch_data, overwrite=True)
                    first_batch_processed = True
                if not first_batch: # add new data in batches - this repeats a few times
                    with Run().context(RunConfig(experiment='notebook')):
                        searcher = Searcher(index=index_name)  # Load the existing index
                        index_updater = IndexUpdater(config=ColBERTConfig(), searcher=searcher, checkpoint=checkpoint)
                        index_updater.add(batch_data[:10])  # Add batch to the index
                        index_updater.persist_to_disk()  # Persist the changes to disk

But I get the following error:

{
"name": "AssertionError",
"message": "",
"stack": "---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[14], line 3
1 # Run the function to load and combine data in batches
2 start = time.time()
----> 3 data = embed_and_index('/home/data', batch_size=100)
4 end = time.time()
5 print(f"Time taken: {end - start} seconds")

Cell In[13], line 106, in embed_and_index(parsed_directory, batch_size, limit)
104 index_updater = IndexUpdater(config=ColBERTConfig(), searcher=searcher, checkpoint=checkpoint)
105 print("Index updater created")
--> 106 index_updater.add(batch_data[:10]) # Add batch to the index
107 print("Batch added")
108 index_updater.persist_to_disk() # Persist the changes to disk

File ~/index_updater.py:170, in IndexUpdater.add(self, passages)
167 curr_pid = start_pid
169 compressed_embs, doclens = self.create_embs_and_doclens(passages)
--> 170 self.update_searcher(compressed_embs, doclens, curr_pid)
172 print_message(f"#> Added {len(passages)} passages from pid {start_pid}.")
173 new_pids = list(range(start_pid, start_pid + len(passages)))

File ~/index_updater.py:129, in IndexUpdater.update_searcher(self, compressed_embs, doclens, curr_pid)
127 codes = compressed_embs.codes[start:end]
128 partitions, _ = self._build_passage_partitions(codes)
--> 129 ivf, ivf_lengths = self._add_pid_to_ivf(partitions, curr_pid, ivf, ivf_lengths)
131 start = end
132 curr_pid += 1

File ~/index_updater.py:418, in IndexUpdater._add_pid_to_ivf(self, partitions, pid, old_ivf, old_ivf_lengths)
415 new_ivf_lengths[-1] += 1
416 partitions_runner += 1
--> 418 assert ivf_runner == len(old_ivf)
419 assert sum(new_ivf_lengths) == len(new_ivf)
421 return new_ivf, new_ivf_lengths

AssertionError: "
}

Do you know how I can resolve this? Does the batch size or other parameters (e.g., document length, embedding size) need to be adjusted when adding new documents to an existing index? Are there specific constraints to consider when initializing the IndexUpdater class to add new documents?

My goal is to batch the data into groups and index one at a time into the same index - is there another way to batch and index data?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError When Adding New Documents to Existing Index via IndexUpdater #369

AssertionError When Adding New Documents to Existing Index via IndexUpdater #369

suhaib1769 commented Sep 26, 2024

AssertionError When Adding New Documents to Existing Index via IndexUpdater #369

AssertionError When Adding New Documents to Existing Index via IndexUpdater #369

Comments

suhaib1769 commented Sep 26, 2024