Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add a size limit on a record batch #2220

Open
niebayes opened this issue Aug 22, 2023 · 9 comments
Open

feat: add a size limit on a record batch #2220

niebayes opened this issue Aug 22, 2023 · 9 comments
Labels
C-feature Category Features good first issue Good for newcomers

Comments

@niebayes
Copy link
Contributor

What problem does the new feature solve?

Currently, there is no size limit set on a record batch, which has become problematic due to the constraints imposed by the Tonic crate on message sizes for both receiving and sending. Tonic employs a default size limit of 4MB for received messages, defined as the DEFAULT_MAX_RECV_MESSAGE_SIZE constant, and a default maximum send message size of 2^64 - 1 bytes, defined as the DEFAULT_MAX_SEND_MESSAGE_SIZE constant.

To address this issue, it's crucial to set a size limit on a record batch to prevent it from exceeding 4MB.

What does the feature do?

Introduce a size limit on a record batch.

Implementation challenges

  • Set a default maximum size for a record batch, which could be either a fixed constant or a configurable option.
  • Utilize this size limit across the codebase to prevent any operations that could lead to the expansion of a record batch beyond the established threshold.
  • Develop a set of unit tests to verify the functionality of the size limit.
@niebayes niebayes added good first issue Good for newcomers C-feature Category Features Size: S labels Aug 22, 2023
@niebayes
Copy link
Contributor Author

Since the unit of transmission is a flight data which may contain at least one record batch, the size limit shall be much more conservative than the default maximum size set by Tonic.

@TheWaWaR
Copy link
Contributor

I'm interested in this feature.

Firstly, I have a question to ask, is this feature for server security by limiting the grpc request/response size?

If that's the case. I can see 2 ways to achieve this:

  • Limit http2 frame size (transport level)
  • Limit grpc service encoding/decoding message size (service level)

@niebayes
Copy link
Contributor Author

@MichaelScofield PTAL

@MichaelScofield
Copy link
Collaborator

@TheWaWaR Thx for your interest. It's not about security. It's a configuration on request/response's size limit in Tonic that should be considered.

I think the service level limitation is enough.

@TheWaWaR
Copy link
Contributor

TheWaWaR commented Sep 20, 2023

By searching the codebase I find max_decoding_message_size already applied to grpc clients. So if I understand correctly the rest tasks should include:

  • Make grpc client max_encoding_message_size configurable
  • Make grpc server max_decoding_message_size/max_encoding_message_size configurable
  • Add tests for both client/server to prove the configuration works

@MichaelScofield
Copy link
Collaborator

@TheWaWaR You are right.

@MichaelScofield
Copy link
Collaborator

@TheWaWaR thx, I've merged your PR. However, to close this issue, I think it's important to set a limit on RecordBatches, too. The RecordBatches are popped from DataFusion, so it requires additional efforts to investigate how to restrict their size. Are you willing to continue to do that?

@TheWaWaR
Copy link
Contributor

TheWaWaR commented Sep 22, 2023

@MichaelScofield

Are you willing to continue to do that?

Yes, why not.

So, I have some new questions:

First, how to define the size of a RecordBatches? There is no Serialize, Deserialize implementation for RecordBatches, but HttpRecordsOutput have Serialize, Deserialize implementation. And serialization can have multiple target formats such as json/msgpack/bincode. So the first question should be proper answered.

Second, what's the intension to limit the size of a RecordBatches? Wherever RecordBatches serialized/deserialized, currently it only transferred use grpc, why limit the grpc message size is not enough? Are there other places RecordBatches used and the size should be limited (other than grpc)? Or let me ask the other way: what are the cases when RecordBatches size not limited become a problem?

@niebayes niebayes assigned niebayes and TheWaWaR and unassigned niebayes Sep 22, 2023
@MichaelScofield
Copy link
Collaborator

@TheWaWaR sorry for the late reply. The size of a RecordBatch is indeed hard to estimate. The best guess might be sum up its vectors' memory_size.

As to the second question, now our grpc interface does have the size limit. However, what if we feed a huge RecordBatch to the grpc interface, will it segment the input inside to fit its size limit? A quick googling suggests that Tonic does not have this type of feature. So I guess it would be a problem if the underlying query engine pops a huge RecordBatch that is larger than the Tonic's size limit. It would be simply failed.

That said, limiting the size of RecordBatch (or segment it) is hard. I think a more proper place to do it is in FlightRecordBatchStream. When the encoded flight data is over the grpc size limit, segment it. This way we avoid calculating the size of RecordBatch. And since FlightData is a plain protobuf message, segmenting it should be easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category Features good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants