Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Initial ORC file format support - WIP #4981

Closed
wants to merge 2 commits into from

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Oct 24, 2023

Which issue does this PR close?

Relates to #4980

Rationale for this change

Initial support for reading ORC files

What changes are included in this PR?

New synchronous reader of ORC files which implements RecordBatch iterator

Supported datatypes:

  • Smallint
  • Int
  • Bigint
  • Float
  • Double
  • String
  • Char
  • Varchar
  • Boolean
  • Tinyint
  • Binary
  • Date

Unsupported:

  • Decimal
  • Timestamps
  • Compound types (nested struct, list, map, union)

Currently does simple read of ORC files into RecordBatches, no filtering support. But does support top level projection.

Supports all compression: None, zlib, snappy, lzo, lz4, zstd (delegates to other rust crates for actual decompression)

Are there any user-facing changes?

New crate

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 24, 2023
@Jefffrey
Copy link
Contributor Author

I've marked it as WIP, but I mainly intend to clean up the code a bit, do more tests and improve docs for now

@tustvold
Copy link
Contributor

Could we perhaps split this into smaller pieces of less than 1000 lines a go?

@alamb
Copy link
Contributor

alamb commented Oct 24, 2023

Could we perhaps split this into smaller pieces of less than 1000 lines a go?

Maybe we could start with the basic scaffolding for the arrow-orc crate (and CI checks, etc) and move on from there

@Jefffrey Jefffrey force-pushed the orc_initial_support branch from 7acd3a3 to c4bbd56 Compare October 25, 2023 10:24
@Jefffrey
Copy link
Contributor Author

I've tried splitting it into a more initial version. To note that the proto stuff (file and generated rust) does make up about 1.2k of the 2.4k total lines changed, but can cut down more if desirable

@alamb @tustvold

@alamb
Copy link
Contributor

alamb commented Oct 26, 2023

Hi @Jefffrey -- thanks for the ping. I don't think I am going to have time to review this PR for a while unfortunately

@alamb
Copy link
Contributor

alamb commented Oct 31, 2023

I left a comment about review capacity here: #4980 (comment)

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Nov 3, 2023

Following discussion in #4980, will focus on implementing ORC file format support in https://github.com/datafusion-contrib/datafusion-orc with aim of eventually bringing into arrow-rs in the future once it's more feature complete. Closing this PR

@Jefffrey Jefffrey closed this Nov 3, 2023
@Jefffrey Jefffrey deleted the orc_initial_support branch April 22, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants