Large Input causing OutputBuffer-BufferOverflowException #69

pgaref · 2016-03-16T23:51:51Z

Just noticed today a rather interesting issue:
I was testing the simple File source example reading String lines from a file.
Each of these line could be rather big (hundreds of bytes-see below). When I used the MarkerSink (meaning the bytes would not have to go over the network) the exampled worked just fine.
On the other hand when I tried to plug in a real Sink I faced the exception below.
I think the exception comes from the OutputBuffer class where we alocating static buffer size:

int headroomSize = this.BATCH_SIZE * 2;
buf = ByteBuffer.allocate(headroomSize);

My question here is how we want to handle this case? Split the input into smaller chunks that fit into the buffers? Dynamicly extend the buffers? I know some people before just increased the batch size to bypass it but this is not really a solution is it?

23:29:49 [SingleThreadProcessingEngine] INFO  SingleThreadProcessingEngine$Worker - Configuring SINGLETHREAD processing engine with 1 inputAdapters
23:29:49 [File-Reader] INFO  Config - FileConfig values: 
    file.path = /home/pg1712/jmeter.log
    character.set = UTF-8
    text.source = true
    serde.type = 0

23:29:49 [SingleThreadProcessingEngine] INFO  SingleThreadProcessingEngine$Worker - Configuring SINGLETHREAD processing engine with 1 outputBuffers
[Processor] data send Size: 81 => total Size: 81
[Processor] data send Size: 129 => total Size: 210
23:29:49 [File-Reader] INFO  FileSelector$Reader - Finished text File Reader worker: File-Reader
Exception in thread "SingleThreadProcessingEngine" java.nio.BufferOverflowException
    at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:189)
    at java.nio.ByteBuffer.put(ByteBuffer.java:859)
    at uk.ac.imperial.lsds.seepworker.core.output.OutputBuffer.write(OutputBuffer.java:93)
    at uk.ac.imperial.lsds.seepworker.core.Collector.send(Collector.java:116)
    at Processor.processData(Processor.java:26)
    at uk.ac.imperial.lsds.seepworker.core.SingleThreadProcessingEngine$Worker.run(SingleThreadProcessingEngine.java:111)
    at java.lang.Thread.run(Thread.java:745)
^C23:30:41 [Thread-2] INFO  WorkerShutdownHookWorker - JVM is shutting down...

The text was updated successfully, but these errors were encountered:

raulcf · 2016-03-17T00:23:00Z

I think the exception comes from the OutputBuffer class where we alocating static buffer size:

We cannot do anything before knowing exactly where the exception comes from and what causes it. Once we know that then we can make a decision. It could just be a bug, right?

pgaref · 2016-03-17T01:25:05Z

Sure - will test some more and update this thread - I thought it was a known limitation ( That we are not splitting big chunks of data, that do not fit in a single buffer, into smaller ones)

raulcf · 2016-03-17T01:47:50Z

I think one limitation is that the batch size must be larger than one single tuple. This is problematic for dynamically sized tuples, (only when one single write can overflow the buffer), but it should be ok for statically sized ones, as it is something we can check statically and throw an error.

WJCIV · 2016-04-06T16:31:42Z

The problem exists in OutputBuffer.write because a tuple is read into a buffer (max size = 2*batch size), then if the buffer is "full" (size > batch size) all tuples in the buffer are processed are processed. If the record is too long then it does not fit into the buffer (which may already be partially full from an earlier, smaller, record).

From this I think it is fair to say that batch size must be set in the WorkerConfig to be at least as long as the longest record. It doesn't make sense to try to process less than a single record at a time. We could split up a long record across multiple batches and put it back together on the other side if necessary (presumably if there are variable sized records and only a small fraction are "big"), but that doesn't seem to fit the model as well.

raulcf · 2016-04-07T00:00:39Z

it is fair to say that batch size must be set in the WorkerConfig to be at least
as long as the longest record.

This is exactly what should happen. The purpose of batching at this level is to decrease the communication/processing ratio per record, for that reason a batch should contain naturally more than 1 record.

We could split up a long record across multiple batches and put it back together on the
other side if necessary (presumably if there are variable sized records and only a small
fraction are "big"), but that doesn't seem to fit the model as well.

This would require a revamp of the current design. That is only justified if we have a use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Input causing OutputBuffer-BufferOverflowException #69

Large Input causing OutputBuffer-BufferOverflowException #69

pgaref commented Mar 16, 2016

raulcf commented Mar 17, 2016

pgaref commented Mar 17, 2016

raulcf commented Mar 17, 2016

WJCIV commented Apr 6, 2016

raulcf commented Apr 7, 2016

Large Input causing OutputBuffer-BufferOverflowException #69

Large Input causing OutputBuffer-BufferOverflowException #69

Comments

pgaref commented Mar 16, 2016

raulcf commented Mar 17, 2016

pgaref commented Mar 17, 2016

raulcf commented Mar 17, 2016

WJCIV commented Apr 6, 2016

raulcf commented Apr 7, 2016