Indexing stuck at encoding passages #355

shubham526 · 2024-07-09T16:08:00Z

I have a huge collection of 116 million passages. I am trying to create a colbert index for them using the indexing code given on the README. To manage the huge size, I am indexing them in batches of 1000 passages. However, the indexing step seems to be stuck at the encoding stage:

2024-07-09 09:15:44,053 - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[Jul 09, 09:15:50] [1]           #> Encoding 999 passages..
[Jul 09, 09:15:50] [0]           # of sampled PIDs = 2000        sampled_pids[:3] = [853, 1500, 20]
[Jul 09, 09:15:50] [0]           #> Encoding 1001 passages..

Is this supposed to take so much time? Not sure if I am doing something wrong.

The text was updated successfully, but these errors were encountered:

shubham526 · 2024-07-09T16:08:18Z

@okhat

okhat · 2024-07-09T16:23:46Z

It shouldn’t get stuck — so if it does that’s odd. But don’t index in 1000 passage batches. Index maybe 10 million at a time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing stuck at encoding passages #355

Indexing stuck at encoding passages #355

shubham526 commented Jul 9, 2024

shubham526 commented Jul 9, 2024

okhat commented Jul 9, 2024

Indexing stuck at encoding passages #355

Indexing stuck at encoding passages #355

Comments

shubham526 commented Jul 9, 2024

shubham526 commented Jul 9, 2024

okhat commented Jul 9, 2024