OAK-11238 - indexing-job: de-duplicate entries when writing to disk the intermediate sorted files with FFS contents. #1835
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The pipelined strategy writes several batches with sorted node state entries as intermediate files, before the final merge phase.
In some cases, these batches may contain duplicate entries (when there was a disconnection to Mongo or with parallel download when the streams cross). As the batches are sorted before being written to disk, it is very cheap to de-duplicate entries (before writing an entry, compare with the previous one).
This only avoids duplicates in a single batch, duplicates may still exist across intermediate sorted files. But this may reduce the size of the intermediate files, and therefore speed-up the merge sort. One case where there may be duplicates in a single batch is when the indexing job disconnects from Mongo and has to retry the download, as in this case it may re-download some entries that were previously downloaded. The gains in the common case should be very minor, but as it is simple and cheap, I believe is worth implementing.