Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discuss: Move into the Apache ORC PMC and develop as apache/orc-rust #120

Open
Xuanwo opened this issue Aug 14, 2024 · 53 comments
Open

discuss: Move into the Apache ORC PMC and develop as apache/orc-rust #120

Xuanwo opened this issue Aug 14, 2024 · 53 comments

Comments

@Xuanwo
Copy link
Collaborator

Xuanwo commented Aug 14, 2024

Hello, everyone. I am initiating this discussion to explore the possibility of moving into the Apache ORC PMC and developing apache/orc-rust.

By developing apache/orc-rust, we will establish this implementation as the official Rust version of ORC, thereby creating a larger and more cohesive community for those interested in a Rust ORC implementation. This will make it much easier for us to build a community around this project.

What are your thoughts? I plan to discuss this with the orc community if contributors are satisfied with it.


cc @Jefffrey @WenyXu @progval @waynexia @klangner @alamb @v0y4g3r @youngsofun @harveyyue

@waynexia
Copy link
Collaborator

About the place to move in, we can also consider Arrow or Datafusion, given this repo is deeply involved with Apache Arrow and Datafusion at the API level. Like the fact that parquet-rs has been maintained in arrow-rs for a long time.

@progval
Copy link
Contributor

progval commented Aug 14, 2024

given this repo is deeply involved with Apache Arrow and Datafusion at the API level

To be honest I find it a bit surprising to have integration with a query engine in an ORC library. Would it make sense to split the Datafusion-related bits into either their own crate (with all the fun of keeping versions in sync), or move them to Datafusion (like it already does with Parquet)?

@waynexia
Copy link
Collaborator

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 14, 2024

Hi, I wasn't involved in the datafusion side of development, so I'm not familiar with the ORC and Datafusion integration. From my perspective as a passerby, datafusion-orc seems more like Datafusion integration rather than focusing solely on the ORC format. I'm a bit worried that this might reduce the likelihood of potential users or contributors finding us.

I agree with @progval that it would be better to separate the ORC support component directly into DataFusion, similar to how parquet is handled.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 14, 2024

We have three possible options:

  • apache/orc-rust
  • apache/arrow-rs/orc
  • apache/datafusion-orc

@wgtmac
Copy link

wgtmac commented Aug 14, 2024

Chiming in from the Apache ORC community. I'm very excited for the discussion! Sorry that I'm not familiar with rust. For the approach of apache/orc-rust, I'd like to know what's the current dependency and what's the estimated amount of work to spilt the repository to remove the datafusion integration part?

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 14, 2024

I'd like to know what's the current dependency

I created a dependency list, and I believe it meets the requirements of the ASF.

please checkout the details here:

Details

0BSD (1): [email protected]
Apache-2.0 (114): [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]+wasi-snapshot-preview1, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]+zstd.1.5.5
Apache-2.0 WITH LLVM-exception (1): [email protected]+wasi-snapshot-preview1
BSD-2-Clause (1): [email protected]
BSD-3-Clause (1): [email protected]
BSL-1.0 (1): [email protected]
CC0-1.0 (1): [email protected]
MIT (115): [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]+wasi-snapshot-preview1, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]+zstd.1.5.5
Unicode-DFS-2016 (1): [email protected]
Unlicense (5): [email protected], [email protected], [email protected], [email protected], [email protected]
Zlib (1): [email protected]

what's the estimated amount of work to spilt the repository to remove the datafusion integration part?

I believe it should be simple since it's just a mod of orc-rust. I'm willing to take on this part of the work.

@waynexia
Copy link
Collaborator

I'm not familiar with the ORC and Datafusion integration. From my perspective as a passerby, datafusion-orc seems more like Datafusion integration rather than focusing solely on the ORC format. I'm a bit worried that this might reduce the likelihood

I find a similar question about the relationship between arrow-rs and parquet-rs apache/arrow-rs#1715. I believe this repo was developed and maintained for the same purpose.

However if we are going to implement features that are not a strong demand from Datafusion side (like ORC writer apache/orc#1507) or integrate it with other consumers (like Databend databendlabs/databend#8016), having a dedicated repo would both reduce the maintenance burden of Datafusion and make the lib itself easier to use.

I agree with the opinion of separating this into two parts. The ORC format resides in a dedicated repo like apache/orc-rust with maintenance from both current contributors and the ORC community. And Datafusion uses it as a downstream user to implement ORC data source. I would like to help with both code work like splitting this code base and non-code work like IP clearance.

@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

I agree with the opinion of separating this into two parts. The ORC format resides in a dedicated repo like apache/orc-rust with maintenance from both current contributors and the ORC community. And Datafusion uses it as a downstream user to implement ORC data source. I would like to help with both code work like splitting this code base and non-code work like IP clearance.

I agree with @waynexia and @progval that the following split makes a lot of sense to me

  1. something like apache/orc-rs (similar to parquet in parquet-rs) that has no datafusion dependencies
  2. this crate datafusion-contrib/datafusion-orc that has the DataFusion table provider and depeneds on apache/orc-rs as well as DataFusion and does the datafusion integration

@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

Like the fact that parquet-rs has been maintained in arrow-rs for a long time.

FWIW I think this was partly an artifiact of history:

  1. there was a time when the parquet PMC was largely focused on java
  2. arrow needed a persistence format and parquet was an obvious choice
  3. so parquet-cpp got made in the arrow repo
  4. we basically followed the same pattern with arrow-rs / parquet-rs (in the arrow repo)

Given the current state of the code, I think it would be plausible to split parquet out of arrow-rs, but I also think unless there is some substantially larger group of maintainers that aren't also maintainers of arrow-rs it is likely easier to leave it there

@wgtmac
Copy link

wgtmac commented Aug 15, 2024

cc @dongjoon-hyun @guiyanakuang @williamhyun @omalley from Apache ORC PMC

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 15, 2024

Given the current state of the code, I think it would be plausible to split parquet out of arrow-rs, but I also think unless there is some substantially larger group of maintainers that aren't also maintainers of arrow-rs it is likely easier to leave it there

Agreed. I have thought about this before but haven't taken any action yet. I mean, it looks appealing to have apache/parquet-rs, but we need to consider the current project status.

I'm starting this thread because I believe it's beneficial for orc-rs to build a community by developing at upstream, but it doesn't seem applicable to parquet-rs at the moment.

@mapleFU
Copy link

mapleFU commented Aug 15, 2024

Previously we discussed split parquet-cpp out of arrow-c++. However the dependency would be weird since there're:

arrow-dataset -> parquet-arrow -> parquet-core -> some arrow core libs

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 15, 2024

arrow-dataset -> parquet-arrow -> parquet-core -> some arrow core libs

I believe the situation is different in parquet-rs since it depends on arrow-rs but not reverse. However, this is not the focus of our discussion. We can start another thread for this if interested.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 15, 2024

Thank you for the discussion. It looks like we can move forward! I think we can:

  • Set up datafusion-contrib/orc-rs first and split the ORC-related code into it (it's better to retain all the history).
  • Move all issues to datafusion-contrib/orc-rs.
  • Send the IP clearance to the ORC PMC.
  • Transfer datafusion-contrib/orc-rs to apache/orc-rs.

cc @alamb @waynexia @wgtmac for comments.

@waynexia
Copy link
Collaborator

I'll start preparing a PR to split the current repo.

Do you have something like guidance for IP clearance? I have attended it before but have not prepared one.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 15, 2024

I'll start preparing a PR to split the current repo.

Thanks!

Do you have something like guidance for IP clearance? I have attended it before but have not prepared one.

I think we can follow https://incubator.apache.org/ip-clearance/

Here's an example from apache/arrow-rs#2096. We can reach out to @alamb if we encounter any problems.

@waynexia
Copy link
Collaborator

Hi @progval @klangner, as part of the IP clearance process, could you please submit an ICLA (Individual Contributor Licence Agreement) following the follow the instructions at https://www.apache.org/licenses/contributor-agreements.html if you do not already have one on file? Thanks in advance for helping with this! If you already have filed one, please let me know the email address associated with your account.

@Jefffrey
Copy link
Collaborator

I would like to chime in my thoughts. I do apologize for being inactive, and have been meaning to pickup the work I left off on this repository (specifically the basic write functionality).

The way I see it, the primary focus of this repository is to serve as an integration with DataFusion to allow querying ORC files. Naturally this required first implementing a layer to read ORC files to Arrow, before then being able to integrate into DataFusion itself (similar to how there is parquet-rs, then the actual parquet integration code in DataFusion).

I can see the merit to splitting up this repository, but perhaps still be too early to do so? One benefit of having both the integration with Arrow and integration with DataFusion in a single repository is that it allows easier development, as these interfaces will be interacting with each other. Splitting across different repositories might make it harder to experiment with the interface for each respective integration, which can slow down development.

Furthermore, I don't think there were any immediate plans to develop a native ORC interface; that is, being able to read ORC in Rust without reading it to Arrow (similar to how parquet-rs has a low level column reader/writer API). From my point of view then, it might seem odd to donate a primarily Arrow <-> ORC interface library to ORC.

@klangner
Copy link
Contributor

I think I have already signed it some time ago while doing some other work.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 15, 2024

I would like to chime in my thoughts. I do apologize for being inactive, and have been meaning to pickup the work I left off on this repository (specifically the basic write functionality).

Thank you very much for your contribution!

I can see the merit to splitting up this repository, but perhaps still be too early to do so?

From my perspective (as a committer on some Apache projects), it's already late for us to do so.

Developing at upstream can create a solid foundation for our entire community to build upon, making it easier for those interested in using ORC in Rust to find this project. Additionally, we can garner more support from the ORC community. Building a strong community is the key to our success. For example, we started iceberg-rust as a very basic project that could only read tables, but it has now grown to 53 contributors with full catalog support. By donating this to ORC, I expect to build a community around it, similar to what we've done with iceberg-rust.

Therefore, instead of waiting for our project to mature and gain full support, I prefer to start and attract more people to join now. I believe it's fine for us to use the existing orc -> arrow code base as a starting point.

Furthermore, I don't think there were any immediate plans to develop a native ORC interface;

I agree, but it depends on the community's feature requests. I would be happy to work with the community if someone wants to collaborate on this.

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.

I do not know how likely that is at this point, though

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 15, 2024

Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.

I have three such cases on my tables:

  • I (of course!) want to build both orc-rs and datafusion-orc seperately.
  • @youngsofun from databend wants native ORC support but not datafusion.
  • @wgtmac from the Apache ORC PMC is interested in a Rust ORC implementation.

@Aitozi
Copy link

Aitozi commented Aug 15, 2024

Yes -- I think one potential benefit to splitting out orc-rs would be that others who are not using it in the context of DataFusion might be more willing to help with the development.

I have three such cases on my tables:

  • I (of course!) want to build both orc-rs and datafusion-orc seperately.
  • @youngsofun from databend wants native ORC support but not datafusion.
  • @wgtmac from the Apache ORC PMC is interested in a Rust ORC implementation.

From paimon-rust may also need a native ORC support but not datafusion

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 15, 2024

From paimon-rust may also need a native ORC support but not datafusion

This case is interesting since paimon-rust will need datafusion but not require orc with datafusion. Paimon requires orc to read data but provides datafusion integration on its own.

@XuQianJin-Stars
Copy link

Is there still the parquet-rust project?

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Is there still the parquet-rust project?

I do not know what parquet-rust refers to

https://parquet.apache.org/docs/contribution-guidelines/sub-projects/ has a list of open source rust implementations

parquet-rs refers to https://github.com/apache/arrow-rs/tree/master/parquet

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 16, 2024

Hi, @alamb. This reminds me that we should establish the CLA for all projects in the datafusion-contrib organization. All contributors should agree that contributions to projects under datafusion-contrib will grant the license to the ASF. Please correct me if this isn't meant for datafusion-contrib.

@alamb
Copy link
Contributor

alamb commented Aug 16, 2024

Hi, @alamb. This reminds me that we should establish the CLA for all projects in the datafusion-contrib organization. All contributors should agree that contributions to projects under datafusion-contrib will grant the license to the ASF. Please correct me if this isn't meant for datafusion-contrib.

I think the idea with datafusion-contrib is to minimize process overhead (such as apache CLAs) and mostly serve as a very disparate set of crates. As they mature, we can then apply more process (as we are doing in this case)

The rationale is that many of the crates in datafusion-contrib will likely never get to the stage where they would be donated to the Apache foundation and thus any up-front cost to prepare for that is wasted effort (and thus reduces contributions)

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 16, 2024

The rationale is that many of the crates in datafusion-contrib will likely never get to the stage where they would be donated to the Apache foundation and thus any up-front cost to prepare for that is wasted effort (and thus reduces contributions)

Understood, thank you. This design makes sense to me.

@Jefffrey
Copy link
Collaborator

I see mention of not needing the DataFusion integration code as motivation, but could this be addressed by splitting the current project to have two subcrates, one for pure Arrow-ORC and the other for DataFusion integration?

I wanted to do this initially but kept DataFusion as a feature to make it easier to develop with, especially since the DataFusion integration code is currently quite small (though I guess the dependency footprint isn't 😅 )

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Sep 11, 2024

I believe I also need a CCLA from my previous employer, as well as the current one (since 2024-08-01). This may take a few weeks. I'll tell you when it's done.

Hi, @progval. Sorry for the interruption. I wanted to check if it works well.

@progval
Copy link
Contributor

progval commented Sep 11, 2024

Hi, this project does not belong to your employer (please correct me if I'm wrong). This donation will be sent from datafusion-contrib to orc. I believe an ICLA is sufficient.

https://www.apache.org/licenses/contributor-agreements.html says CCLAs are for "For a corporation that assigns employees to work on an Apache project", and I was an employee assigned to work on the project.

Either way, I need my ex-employer's permission for the ICLA

Hi, @progval. Sorry for the interruption. I wanted to check if it works well.

My ex-employer's staff came back from summer vacation this week, and they are about to start processing my request. Current employer won't be an issue.

Sorry for the delay.

@progval
Copy link
Contributor

progval commented Oct 22, 2024

I just submitted my ICLA, and a CCLA from my ex-employer who owns all my past contributions.

I also started the process of getting a CCLA from my current employer, and will abstain from contributing until I can get it. this is now resolved

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 22, 2024

I just submitted my ICLA, and a CCLA from my ex-employer who owns all my past contributions.

Wow, really great!


cc @waynexia, are you still interested in working on this? The action items are:

  • Set up datafusion-contrib/orc-rs first and split the ORC-related code into it (it's better to retain all the history).
  • Move all issues to datafusion-contrib/orc-rs.
  • Send the IP clearance to the ORC PMC.
  • Transfer datafusion-contrib/orc-rs to apache/orc-rs.

Please let know me if you need any hand.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 22, 2024

cc @alamb, would you like to help create datafusion-contrib/orc-rs first? Or we can just rename this repo to orc-rs?

@waynexia
Copy link
Collaborator

waynexia commented Oct 22, 2024

That's great news!

cc @waynexia, are you still interested in working on this?

I'm resuming the IP clearance procedure, and will update any future problems to this thread.

Edit: as well as the code split things

@alamb
Copy link
Contributor

alamb commented Oct 22, 2024

https://github.com/datafusion-contrib/orc-rs is setup with @Xuanwo and @waynexia as admins

@waynexia
Copy link
Collaborator

IP Clearance file is updated to https://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/orc-rs.xml

@waynexia
Copy link
Collaborator

Cross-referencing the thread from [email protected] https://lists.apache.org/thread/l6b0hsq29rr6to96tqmjpxt2mwz4nzbc

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 23, 2024

Hi @alamb, apologies for my mistake. The repository should be named orc-rust instead. orc-rs is a different rust crate that has been abandoned.

I'm going to rename it now, just FYI.

@waynexia
Copy link
Collaborator

Donation PR: datafusion-contrib/orc-rust#1

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 24, 2024

Donation PR: datafusion-contrib/orc-rust#1

Merged! We can remove duplicate code in this repo and transfer issues to new repositories now.

@waynexia
Copy link
Collaborator

Great 🎉

There is a temporary branch datafusion-integration made by spliting the current repo. But I'm not sure which approach is prefered:

  • Send a PR to this repo's main branch to remove donated code, or
  • Make that branch the default new main of this repo

Context: from the above discussion this repo will only focus on DataFusion-ORC data source integration in the future.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 25, 2024

I personally feel that it's better to send a PR to remove donated code. Having a repo where the main branch is not main could be confusing.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 29, 2024

Hi, @waynexia, I believe we are ok to implement this change.

@waynexia
Copy link
Collaborator

Hi, @waynexia, I believe we are ok to implement this change.

Thanks for reminding 🙈 I'll file a PR to the current main to remove ORC implementation and use the released upstream instead tonight.

@waynexia
Copy link
Collaborator

Progress update: the entire process is almost done if I don't miss anything (code split, ip clearance, transferring issue & tag etc). One last remaining thing is waiting for the ORC PMC to accept https://github.com/datafusion-contrib/orc-rust

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 30, 2024

Progress update: the entire process is almost done if I don't miss anything (code split, ip clearance, transferring issue & tag etc). One last remaining thing is waiting for the ORC PMC to accept datafusion-contrib/orc-rust

Thank you! cc @wgtmac, would you like to start a VOTE for this?

@wgtmac
Copy link

wgtmac commented Oct 30, 2024

Thanks for the heads up! @Xuanwo

Could you please provide the list of committers that will join the Apache ORC PMC? I will include this in the vote as well.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 30, 2024

Thanks for the heads up! @Xuanwo

Could you please provide the list of committers that will join the Apache ORC PMC? I will include this in the vote as well.

I propose to have the top 5 contributors of this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants