Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support reading and writingStringView and BinaryView in parquet (part 1) #5618

Merged
merged 5 commits into from
Apr 9, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 9, 2024

Which issue does this PR close?

First part of #5530

Rationale for this change

This is the first 3 commits from #5557 by @ariesdevil for initial support of reading StringViewArray and BinaryViewArray from parquet

The performance is non ideal (it copies string data several times) but it does include benchmarks and tests.

Thus I would like to merge this in and then continue iterating on the design in #5557 to keep the PRs smaller and more manageable

What changes are included in this PR?

  1. Basic support for reading/writingStringViewArray and BinaryViewArray from parquet
  2. Tests for same
  3. Benchmarks

Are there any user-facing changes?

The arrow reader/writer can now read /write StringViewArray and BinaryViewArrays

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 9, 2024
@mapleFU
Copy link
Member

mapleFU commented Apr 9, 2024

Just a minor question: during writing parquet, arrow will serialize self-schema. Would BinaryView/StringView keep it's schema, or using Binary/String here?

@ariesdevil
Copy link
Contributor

Thanks for splitting this PR into small parts, looks good to me.

@ariesdevil
Copy link
Contributor

Just a minor question: during writing parquet, arrow will serialize self-schema. Would BinaryView/StringView keep it's schema, or using Binary/String here?

View type will keep its schema

@alamb
Copy link
Contributor Author

alamb commented Apr 9, 2024

Merging this one in as it doesn't expose any public PRs and has been reviewed as a prior PR #5557 (review)

Let's continue the conversation on #5557

@alamb alamb merged commit 91f0b17 into apache:master Apr 9, 2024
27 checks passed
@alamb alamb deleted the alamb/parquet_slow branch April 9, 2024 19:32
@alamb
Copy link
Contributor Author

alamb commented Apr 9, 2024

Thanks again @ariesdevil and @mapleFU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants