OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. #1835

nfsantos · 2024-10-30T15:03:22Z

The pipelined strategy writes several batches with sorted node state entries as intermediate files, before the final merge phase.
In some cases, these batches may contain duplicate entries (when there was a disconnection to Mongo or with parallel download when the streams cross). As the batches are sorted before being written to disk, it is very cheap to de-duplicate entries (before writing an entry, compare with the previous one).
This only avoids duplicates in a single batch, duplicates may still exist across intermediate sorted files. But this may reduce the size of the intermediate files, and therefore speed-up the merge sort. One case where there may be duplicates in a single batch is when the indexing job disconnects from Mongo and has to retry the download, as in this case it may re-download some entries that were previously downloaded. The gains in the common case should be very minor, but as it is simple and cheap, I believe is worth implementing.

… state entries.

De-duplicate entries when writing sorted intermediate batches of node…

ed47054

… state entries.

thomasmueller approved these changes Oct 30, 2024

View reviewed changes

nfsantos merged commit c3e7c20 into apache:trunk Oct 30, 2024
1 of 2 checks passed

nfsantos deleted the OAK-11238 branch October 30, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. #1835

OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. #1835

nfsantos commented Oct 30, 2024 •

edited

Loading

OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. #1835

OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. #1835

Conversation

nfsantos commented Oct 30, 2024 • edited Loading

nfsantos commented Oct 30, 2024 •

edited

Loading