You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello
I am using this library for the first time and running into an issue. I am trying to batch a large data set into smaller chunks and then index them in one all in one go. I get an AssertionError when adding new documents to an existing index using the IndexUpdater class. I can successfully create an initial index, but the error occurs when I try adding more data using the IndexUpdater.
Here is a simplified version of the code I am using:
if first_batch:
with Run().context(RunConfig(nranks=1, experiment='first_batch')):
config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits)
indexer = Indexer(checkpoint=checkpoint, config=config)
indexer.index(name=index_name, collection=batch_data, overwrite=True)
first_batch_processed = True
if not first_batch: # add new data in batches - this repeats a few times
with Run().context(RunConfig(experiment='notebook')):
searcher = Searcher(index=index_name) # Load the existing index
index_updater = IndexUpdater(config=ColBERTConfig(), searcher=searcher, checkpoint=checkpoint)
index_updater.add(batch_data[:10]) # Add batch to the index
index_updater.persist_to_disk() # Persist the changes to disk
But I get the following error:
{
"name": "AssertionError",
"message": "",
"stack": "---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[14], line 3
1 # Run the function to load and combine data in batches
2 start = time.time()
----> 3 data = embed_and_index('/home/data', batch_size=100)
4 end = time.time()
5 print(f"Time taken: {end - start} seconds")
Cell In[13], line 106, in embed_and_index(parsed_directory, batch_size, limit)
104 index_updater = IndexUpdater(config=ColBERTConfig(), searcher=searcher, checkpoint=checkpoint)
105 print("Index updater created")
--> 106 index_updater.add(batch_data[:10]) # Add batch to the index
107 print("Batch added")
108 index_updater.persist_to_disk() # Persist the changes to disk
Do you know how I can resolve this? Does the batch size or other parameters (e.g., document length, embedding size) need to be adjusted when adding new documents to an existing index? Are there specific constraints to consider when initializing the IndexUpdater class to add new documents?
My goal is to batch the data into groups and index one at a time into the same index - is there another way to batch and index data?
The text was updated successfully, but these errors were encountered:
Hello
I am using this library for the first time and running into an issue. I am trying to batch a large data set into smaller chunks and then index them in one all in one go. I get an AssertionError when adding new documents to an existing index using the IndexUpdater class. I can successfully create an initial index, but the error occurs when I try adding more data using the IndexUpdater.
Here is a simplified version of the code I am using:
But I get the following error:
Do you know how I can resolve this? Does the batch size or other parameters (e.g., document length, embedding size) need to be adjusted when adding new documents to an existing index? Are there specific constraints to consider when initializing the IndexUpdater class to add new documents?
My goal is to batch the data into groups and index one at a time into the same index - is there another way to batch and index data?
The text was updated successfully, but these errors were encountered: