Skip to content

Conversation

ashish159357
Copy link

Problem

ByteBlockPool uses 32KB buffers with an integer offset tracker ( byteOffset). When more than 65,535 buffers are allocated, integer overflow occurs in the byteOffset calculation (byteOffset = bufferUpto * BYTE_BLOCK_SIZE), causing ArithmeticException during indexing of documents with large numbers of tokens.

Root Cause

  • Each buffer is 32KB (BYTE_BLOCK_SIZE = 32768)
  • Maximum safe buffer count: Integer.MAX_VALUE / BYTE_BLOCK_SIZE = 65535
  • When bufferUpto >= 65535, the multiplication overflows

Solution
Implement proactive DWPT flushing when buffer count approaches the limit:

  1. Detection: Added isApproachingBufferLimit() method to detect when buffer count approaches the overflow threshold
  2. Propagation: Buffer limit status flows from ByteBlockPool → IndexingChain → DocumentsWriterPerThread → DocumentsWriterFlushControl
  3. Prevention: Force flush DWPT before overflow occurs, similar to existing RAM-based flushing.

Key Changes

  • Added buffer limit detection in ByteBlockPool
  • Integrated check into DocumentsWriterFlushControl.doAfterDocument()
  • Uses threshold of 65,000 to provide safety margin before actual limit of 65,535
  • Maintains existing performance characteristics while preventing crashes

@ashish159357 ashish159357 changed the title Fix ByteBlockPool integer overflow by implementing buffer limit detection #15152 Fix ByteBlockPool integer overflow by implementing buffer limit detection Oct 12, 2025
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@rmuir
Copy link
Member

rmuir commented Oct 12, 2025

When more than 65,535 buffers are allocated, integer overflow occurs in the byteOffset calculation (byteOffset = bufferUpto * BYTE_BLOCK_SIZE), causing ArithmeticException during indexing of documents with large numbers of tokens.

But this is not supported: the limits on IndexWriter are 2GB

@msokolov
Copy link
Contributor

maybe AI-generated? The bullet point formatting looks characteristic. Not that that is banned or anything, but it might need additional scrutiny

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@bharath-techie
Copy link

Hi @rmuir @msokolov ,
I'm yet to review this PR. But I see your points as the hard limit check should be enough as it accounts for byteBlockPool as well.

For context , I originally created this issue #15152 - where an opensearch user encountered the byteblockpool overflow during recovery.

 message [shard failure, reason [index id[3458764570588151359] origin[LOCAL_TRANSLOG_RECOVERY] seq#[53664468]]], failure [NotSerializableExceptionWrapper[arithmetic_exception: integer overflow]], markAsStale [true]]
NotSerializableExceptionWrapper[arithmetic_exception: integer overflow]
    at java.lang.Math.addExact(Math.java:883)
    at org.apache.lucene.util.ByteBlockPool.nextBuffer(ByteBlockPool.java:199)
    at org.apache.lucene.index.ByteSlicePool.allocKnownSizeSlice(ByteSlicePool.java:118)
    at org.apache.lucene.index.ByteSlicePool.allocSlice(ByteSlicePool.java:98)
    at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:226)
    at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:266)
    at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
    at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:197)
    at org.apache.lucene.index.TermsHashPerField.positionStreamSlice(TermsHashPerField.java:214)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:202)
    at org.apache.lucene.index.IndexingChain$PerField.invertTokenStream(IndexingChain.java:1287)
    at org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1183)
    at org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:731)
    at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:609)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:263)
    at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
    at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1558)
    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1516)
    at org.opensearch.index.engine.InternalEngine.addStaleDocs(InternalEngine.java:1291)
    at org.opensearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1210)
    at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1011)
    at org.opensearch.index.shard.IndexShard.index(IndexShard.java:1226)

I think the check for IndexWriterHardLimit in FlushControl comes after we do DocumentsWriter.updateDocuments where adding many documents could potentially exceed the limit and hit this exception.

  1. Do we need a buffer for writer limits to account for next set of documents ?
  2. Do we need to limit the number of docs that can be passed to this method ?

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants