Add asynchronous prefetch for DirectIO directory #15224

benwtrent · 2025-09-23T21:31:36Z

This adds prefetching to directIO. The idea is pretty simple,

configure a number of "prefetch buffers" that are the same size as the directIO buffer
calling prefetch will start a prefetch virtual thread to fill an available buffer
On read, DirectIO will attempt to refill from any prefetched buffers that match the position before attempting to do directIO itself.

When doing many prefetches and handling things in batches, this can significantly improve throughput.

github-actions · 2025-09-23T21:32:30Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

mccullocht · 2025-10-01T17:48:35Z

lucene/misc/src/java/org/apache/lucene/misc/store/DirectIODirectory.java

+    private final int prefetchBytesSize;
+    private final Deque<Long> pendingPrefetches = new ArrayDeque<>();
+    private final FileChannel channel;
+    private final ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();


Is the executor something you would want to share within a directory or potentially even across directories? I can't find any documentation that indicates that this pattern would be a problem.

dungba88 · 2025-10-16T06:48:59Z

Do you have some number on throughput change?

mikemccand · 2025-10-16T11:58:11Z

This is neat -- if Lucene implements enough top-down prefetch hinting, it might eventually be that DirectIO, alone, is sufficient for good query latency/throughput? I.e. we could stop entirely relying on OS to do its prefetching/caching (buffer cache), maybe, in very cold indices?

Isn't DirectIODirectory today only inserting itself for merge context reading & writing?

benwtrent · 2025-10-16T16:37:13Z

Isn't DirectIODirectory today only inserting itself for merge context reading & writing?

Correct, its only used in certain scenarios. We are experimenting using it in more areas (e.g. vector rescoring, to keep from polluting the off-heap cache with rescoring vectors)

it might eventually be that DirectIO, alone, is sufficient for good query latency/throughput?

Its not quite there yet. I have seen this improve throughput by more than 2x though depending on the read patterns. MMAP still has TONS of advantages (direct memory segment access being a HUGE one for vectors).

Virtual threads make this VERY easy, but I am sure there is a lot of headroom for improvements.

benwtrent · 2025-10-16T16:38:05Z

I also think that NIOFS could benefit of a prefetch implementation as well.

mccullocht · 2025-10-16T17:10:21Z

If you used direct io for everything you would want to introduce an explicit disk cache somewhere, even with prefetching I don't think performance would meet expectations for a lot of workloads if most reads resulted in a syscall.

benwtrent · 2025-10-16T17:15:05Z

If you used direct io for everything you would want to introduce an explicit disk cache somewhere, even with prefetching I don't think performance would meet expectations for a lot of workloads if most reads resulted in a syscall.

100% agreed. I think we are a long ways away from making IO super cheap.

Again, MMAP has many benefits still. But virtual threads do make this way easier than it would have been before!

benwtrent added 2 commits September 23, 2025 17:25

Add asynchronous prefetch for DirectIO directory

9bf96d4

iter

5063cb5

github-actions bot added the module:misc label Sep 23, 2025

changes

d745254

github-actions bot added this to the 11.0.0 milestone Sep 23, 2025

mccullocht reviewed Oct 1, 2025

View reviewed changes

benwtrent mentioned this pull request Oct 3, 2025

Add IO prefetch to HNSW graph crawl? #15286

Open

fixing edge case

7cd24ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add asynchronous prefetch for DirectIO directory #15224

Add asynchronous prefetch for DirectIO directory #15224

benwtrent commented Sep 23, 2025

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

mccullocht Oct 1, 2025

Uh oh!

dungba88 commented Oct 16, 2025

Uh oh!

mikemccand commented Oct 16, 2025

Uh oh!

benwtrent commented Oct 16, 2025

Uh oh!

benwtrent commented Oct 16, 2025

Uh oh!

mccullocht commented Oct 16, 2025

Uh oh!

benwtrent commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add asynchronous prefetch for DirectIO directory #15224

Are you sure you want to change the base?

Add asynchronous prefetch for DirectIO directory #15224

Conversation

benwtrent commented Sep 23, 2025

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

mccullocht Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

dungba88 commented Oct 16, 2025

Uh oh!

mikemccand commented Oct 16, 2025

Uh oh!

benwtrent commented Oct 16, 2025

Uh oh!

benwtrent commented Oct 16, 2025

Uh oh!

mccullocht commented Oct 16, 2025

Uh oh!

benwtrent commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants