-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Input causing OutputBuffer-BufferOverflowException #69
Comments
We cannot do anything before knowing exactly where the exception comes from and what causes it. Once we know that then we can make a decision. It could just be a bug, right? |
Sure - will test some more and update this thread - I thought it was a known limitation ( That we are not splitting big chunks of data, that do not fit in a single buffer, into smaller ones) |
I think one limitation is that the batch size must be larger than one single tuple. This is problematic for dynamically sized tuples, (only when one single write can overflow the buffer), but it should be ok for statically sized ones, as it is something we can check statically and throw an error. |
The problem exists in OutputBuffer.write because a tuple is read into a buffer (max size = 2*batch size), then if the buffer is "full" (size > batch size) all tuples in the buffer are processed are processed. If the record is too long then it does not fit into the buffer (which may already be partially full from an earlier, smaller, record). From this I think it is fair to say that batch size must be set in the WorkerConfig to be at least as long as the longest record. It doesn't make sense to try to process less than a single record at a time. We could split up a long record across multiple batches and put it back together on the other side if necessary (presumably if there are variable sized records and only a small fraction are "big"), but that doesn't seem to fit the model as well. |
This is exactly what should happen. The purpose of batching at this level is to decrease the communication/processing ratio per record, for that reason a batch should contain naturally more than 1 record.
This would require a revamp of the current design. That is only justified if we have a use case. |
Just noticed today a rather interesting issue:
I was testing the simple File source example reading String lines from a file.
Each of these line could be rather big (hundreds of bytes-see below). When I used the MarkerSink (meaning the bytes would not have to go over the network) the exampled worked just fine.
On the other hand when I tried to plug in a real Sink I faced the exception below.
I think the exception comes from the OutputBuffer class where we alocating static buffer size:
My question here is how we want to handle this case? Split the input into smaller chunks that fit into the buffers? Dynamicly extend the buffers? I know some people before just increased the batch size to bypass it but this is not really a solution is it?
The text was updated successfully, but these errors were encountered: