-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to donate Ray SQL to the DataFusion Project (not into the Python subproject) #872
Comments
Ray SQL project: https://github.com/datafusion-contrib/ray-sql |
@franklsf95 fyi |
Things I'm thinking of that should be showcased or discussed:
|
Is the complexity of Ballista coming from cluster management or query planning? |
Ballista and Ray SQL use the same query planning code. The complexity comes from cluster management/communication. |
Since upgrading to a more recent DataFusion version, Ballista is no longer capable of running the TPC-H suite. It works fine in the 0.12.0 release. |
I am curious about the right boundary here. If we go through with this, should the code become a part of datafusion-python, or should the former stay as a single node thing focusing on excellent bindings to core, and we should have a "datafusion-ray" that uses datafusion-python as a library? |
Thanks @ozankabak I'm curious on the boundary, too. |
Indeed. Do you have an idea of how the subset would look like? |
To expand on this, there are several options for adopting the Ray SQL code as part of the DataFusion project.
|
100% yes. If Ray SQL can handle all TPC-H queries with just 1.7k lines of code, it’s sounds like a no-brainer. This actually makes a production-ready distributed DataFusion sound achievable and it should attract more contributors. |
One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray |
Docs/Readme.md probably needs a quick update for anyone joining the catchup. Distributed shuffle using rays object store was implemented in #33.
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Andrew Lamb ***@***.***>
Sent: Monday, September 16, 2024 6:32:22 PM
To: apache/datafusion-python ***@***.***>
Cc: Jordan Fox ***@***.***>; Comment ***@***.***>
Subject: Re: [apache/datafusion-python] Proposal to Introduce Ray SQL into DataFusion Python (Issue #872)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
One question I have is who would maintain this new code? Ballista I think suffers from a slow decline due to lack of active maintenance and community. We should try to avoid the same thing happening to Ray
—
Reply to this email directly, view it on GitHub<#872 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHXVCFKTRUINKJ527GAANVDZW52BNAVCNFSM6AAAAABOIBWHDCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUGI3DMMJUGI>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Here is the PR in ray-sql that added support for distributed shuffle using the Ray object store: |
That's a great point. If the project were to become part of the Apache DataFusion project then I would certainly put time into maintaining it and helping build community around the project. I am not able to contribute in its current location. I have recently been attempting to maintain Ballista by upgrading to more recent versions of DataFusion, but the project is large and complex and the original contributors of much of this code are no longer available to help, so it is challenging. I believe that DataFusion + Ray is an opportunity to start fresh on a solution for distributed DataFusion as a much lighter weight alternative to Ballista and the project is small enough (~40 commits) that it will be easier for new contributors to follow along. These are the initial tasks that I would plan on working on (with the community, hopefully) if we were to move forward with this proposal.
Another possibility is that interested contributors could start maintaining the project in its current location, but I am not sure who would be able to approve the PRs. |
Given that I am +1 on this proposal 🚀 |
I think I have admin rights to the https://github.com/datafusion-contrib org and so can add / remove people from the project as necessary. Just let me know |
Really excited to see this happening! I contributed some code to Ray SQL last year (most notably, making distributed shuffle work using Ray's distributed object store), and can help answer any question regarding Ray (I work with people who build Ray, Ray Data, etc.) On a high level, by building on top of Ray, you get a distributed execution substrate for free. Ray handles managing the cluster, scheduling tasks, managing distributed memory, fault tolerance, to name a few. This would mean basically to use DataFusion as a single-node query execution engine and build all the distributed stuff in Python on top of Ray. If this is in line with the goal of this project (or the DataFusion project), then I think it would be a good way to go. |
I've been thinking some more about where this code should live. I am now leaning towards putting the code into a new standalone My reasoning for this:
|
I have been thinking a lot about this and how it would fit into or use
Regarding the question of where it should sit, I would recommend a separate repo |
Seems like everyone is on the same page, preferring Based on @andygrove's reasonings, sounds a good idea starting a separate Following this way, this proposal is not relevant in datafusion-python anymore. What's the next actionable items? Would love to help~ |
It does sound like we have some concensus. The next steps are to start the IP clearance process. I will make a start on that this weekend. It basically involves filling out a form and starting a vote. Once the vote passes, I can create the new |
@franklsf95 Would you be able to file an ICLA (unless you already have one)? The instructions are at https://www.apache.org/licenses/contributor-agreements.html @austin362667 Before we can start the IP clearance process, we need to update the existing source files to use the ASF header. Is that something you would be able to take on? |
@andygrove Thanks for updating the title and yes, I can help.
|
The PR Add ASF license header |
Just submitted one to [email protected]. |
@austin362667 I created the new repository: https://github.com/apache/datafusion-ray Could you open a draft PR against this repository to add the code including ASF headers? Once this is done, the next steps will be:
Thanks. |
I ran some benchmarks from @timsaucer's repo where he has upgraded to DataFusion 41. Performance is looking good. |
Thanks, @andygrove The result looks super promising! And here is the draft PR apache/datafusion-ray#1 for the Datafusion Ray donation. |
I created an issue in the new repo to track the donation process and for the work that I think needs to happen after that so that we can get to a 0.1.0 release. |
As a formality, as part of the ASF IP clearance process, I must remind active committers that they are responsible for ensuring that a Corporate CLA is recorded if such is required to authorize their contributions under their individual CLA. cc @franklsf95 |
The donation has now been accepted, so I will close this issue. Thank you to everyone who participated in this discussion. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In light of the complexities associated with maintaining and debugging Ballista, the community would like to propose exploring the adoption of Ray SQL within the DataFusion Python project.
Ray SQL, with only 1.7k lines of Rust, is significantly simpler compared to Ballista’s 27k lines. Despite its smaller codebase, Ray SQL has demonstrated the ability to run all TPC-H queries, showcasing its robustness and simplicity. This reduction in complexity could ease maintenance, lower the learning curve for contributors, and provide a functional distributed SQL solution that is much simpler than Ballista because Ray provides so much – there is no need for us to build scheduler and executor processes – we can simply execute Python tasks in the Ray cluster.
If there is enough interest and support, we could start working on bringing the Ray SQL code into the DataFusion Python project.
Describe the solution you'd like
Bringing the Ray SQL prototype into the DataFusion Python project
Describe alternatives you've considered
Building a new version inspired by the Ray SQL code
Additional context
We might need to go through the Apache IP clearance process for importing the external Ray SQL codebase in
datafusion-contrib
which is not part of Apache.The text was updated successfully, but these errors were encountered: