Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: spawn sync parquet write on blocking runtime #2806

Closed

Conversation

alexwilcoxson-rel
Copy link
Contributor

Description

In our service we have tried to dance around the fact the underlying Delta/PartitionWriter runs the synchronous ArrowWriter write within an async method, blocking a runtime thread. This PR allows you to opt into to supplying a runtime to the DeltaWriter to spawn blocking tasks on.

I intend to clean up some of the code by implementing some of the methods on WriterState enum and write some tests, but want to get initial feedback.

Using this in our service with its own runtime has simplified our code and kept the main runtime free for io and incoming requests.

Related Issue(s)

Documentation

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Aug 20, 2024
@ion-elgreco
Copy link
Collaborator

Do you still need this now that you can wrap all IO tasks in their own runtime?

@rtyler
Copy link
Member

rtyler commented Aug 20, 2024

In case you had missed it, @ion-elgreco is referring to the work in #2789

@alexwilcoxson-rel
Copy link
Contributor Author

Do you still need this now that you can wrap all IO tasks in their own runtime?

I think so because of the following:

  • We have a stream of incoming batches we are piping into multiple calls DeltaWriter::write then we finish with DeltaWriter::close
  • Then we have potentially many of those streams happening at once.
  • So I could spawn blocking the write and close methods on my end but then they are blocking the pool of blocking threads while they await any IO
  • I could spawn (not blocking) them on a runtime, that is better but the blocking code is eating up a lot of time and causes tasks scheduling to build up when under load.
  • With this solution (along with feat: configurable IO runtime #2789), pushing the blocking lower to where it is actually happening makes it pretty flexible in how Rust users can structure the application to account for scheduling CPU tasks and IO

I do think this should maybe be a temporary measure until the writer could be refactored to use the upcoming Parquet writer improvements for AsyncWrites.

@alexwilcoxson-rel
Copy link
Contributor Author

resolved using DeltaIOStorageBackend and a separate runtime for the write and close calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants