Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support reading and writingStringView and BinaryView in parquet (part 2) #5557

Closed
wants to merge 5 commits into from

Conversation

ariesdevil
Copy link
Contributor

@ariesdevil ariesdevil commented Mar 26, 2024

Which issue does this PR close?

Closes #5530

Rationale for this change

Use view_buffer instead of offset_buffer for view type parquet reader

What changes are included in this PR?

Use view_buffer instead of offset_buffer for view type parquet reader

Are there any user-facing changes?

Yes

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Mar 26, 2024
@ariesdevil ariesdevil force-pushed the parquet branch 5 times, most recently from 17695d3 to 1c7260e Compare March 28, 2024 16:23
@ariesdevil
Copy link
Contributor Author

The parquet build for wasm32 check always failed, but it works on my machine...

@ariesdevil ariesdevil marked this pull request as ready for review March 28, 2024 16:33
@ariesdevil
Copy link
Contributor Author

Hi @alamb @tustvold , this PR is not really ready for review, however, I may need some advice, and then I'll continue.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2024

Thanks @ariesdevil -- I think @tustvold is away for a few days. I will try and give this PR a look over the next day or two. Very exciting!

cc @XiangpengHao

@ariesdevil ariesdevil force-pushed the parquet branch 2 times, most recently from d2f1a30 to fdd2e48 Compare March 31, 2024 15:18
@alamb alamb changed the title feat: support string and binary view read write parquet feat: support reading and writingStringView and BinaryView in parquet Apr 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ariesdevil -- I think this is looking close to something we can merge 🙏

To merge this, I think we need:

  1. Benchmarks for reading/writing StringView / BinaryViews
  2. Cover a few more cases for round tripping

I think there is quite a bit more performance to be had as well by avoiding string copies. This PR will copy the data I think twice (once out of parquet buffer and once out of the offset buffer).

It would be nice to avoid at least one of those copies prior to merge, but I also think once we have the test coverage it would be ok to merge this PR and then do additional optimizations as follow on PRs

We can add benchmarks here:

  • reading
    // string benchmarks
    //==============================
  • writing:
    let batch = create_string_bench_batch(4096, 0.25, 0.75).unwrap();
    group.throughput(Throughput::Bytes(
    batch
    .columns()
    .iter()
    .map(|f| f.get_array_memory_size() as u64)
    .sum(),
    ));
    group.bench_function("4096 values string", |b| {
    b.iter(|| write_batch(&batch).unwrap())
    });

End to end round trip tests:

parquet/src/arrow/buffer/offset_buffer.rs Outdated Show resolved Hide resolved
parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

This is really nice step forward -- thank you @ariesdevil

@ariesdevil ariesdevil force-pushed the parquet branch 2 times, most recently from f37c3dd to f65807f Compare April 2, 2024 15:17
alamb
alamb previously approved these changes Apr 2, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ariesdevil -- I think this PR looks like a really nice incremental step.

As I think we have identified, it can be significantly improved performance wise but having the basic tests and benchmarks in place we are in a good space to optimize it.

I left some other smaller style / documentation comments that might be nice to cleanup, but I don't think they are strictly required

Let's wait another day for @tustvold to see if he wants to review (i think he is still not back yet) but otherwise we'll merge this PR and add additional features as a follow on PR

arrow-array/src/array/byte_view_array.rs Outdated Show resolved Hide resolved
arrow/src/util/bench_util.rs Outdated Show resolved Hide resolved
parquet/benches/arrow_writer.rs Outdated Show resolved Hide resolved
parquet/src/arrow/buffer/offset_buffer.rs Outdated Show resolved Hide resolved
@ariesdevil ariesdevil force-pushed the parquet branch 3 times, most recently from a22ee14 to d286521 Compare April 4, 2024 10:56
@alamb
Copy link
Contributor

alamb commented Apr 4, 2024

I plan to give this another review today

@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

The integration test https://github.com/apache/arrow-rs/actions/runs/8553472637/job/23436722309?pr=5557 seems to have failed for reasons unrelated to this PR. I have restarted it

################# FAILURES #################
FAILED TEST: union Rust producing,  Rust consuming
<class 'RuntimeError'>: Command failed: /build/rust/debug/flight-test-integration-client --host localhost --port=34799 --path /tmp/arrow-integration-92iwbbih/generated_union.json
With output:
--------------
Error: tonic::transport::Error(Transport, hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })))

@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

I am sorry @ariesdevil -- I need more time and a fresh set of eyes to review the new implementation in this PR. I will find time over the next few days.

@ariesdevil
Copy link
Contributor Author

Hi @alamb ,the new implementation has indeed changed a lot. It should have been implemented in another PR, but the original performance is really poor.

Thanks for your patience and have a nice weekend.

@alamb
Copy link
Contributor

alamb commented Apr 12, 2024

  • Validate utf8 data (I left comments). I think we need a test for that too -- I'll make a PR to make that easier to add.

I wrote new tests in #5639

@ariesdevil ariesdevil force-pushed the parquet branch 3 times, most recently from fef3c9f to cbf08f4 Compare April 15, 2024 13:13
@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

I am on vacation this week but I hope to find some time later today or tomorrow to give this another look. Thanks @ariesdevil

@ariesdevil
Copy link
Contributor Author

No need to rush, enjoy your vacation, just so I can have time to rethink how to continue optimization 😃

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

I see some new updates -- I plan to look at this PR later this week (probably won't be until Wednesday)

@XiangpengHao
Copy link
Contributor

Hi @ariesdevil, thanks for the great work! I'm looking at this PR these days and trying to push it forward. However, I do feel the code change is too large for a single PR that it is quite difficult to understand every detail of the implementation.

I suggest we break the PR into even smaller pieces and incrementally add the features/optimizations so it poses less mental burdens on the reviewers, what do you think?

I'll try to consolidate a few low-hanging fruit changes and try to get them merged first; let me know your thoughts!

@ariesdevil
Copy link
Contributor Author

Hi @XiangpengHao, thanks for your reply. I don't think this PR is very large, the previous reviewers are already very familiar with this PR, this PR has been submitted for almost three months, and the reviewer has had enough time to get familiar with it.

@alamb
Copy link
Contributor

alamb commented Jun 13, 2024

Hi @XiangpengHao, thanks for your reply. I don't think this PR is very large, the previous reviewers are already very familiar with this PR, this PR has been submitted for almost three months, and the reviewer has had enough time to get familiar with it.

Indeed -- I am sorry this PR is stalled @ariesdevil -- it has been very hard for me to find enough contiguous time to get familar enough with this PR and really understand it (and ensure the test coverage is adequate, etc).

@ariesdevil do you think we should attempt to merge this entire PR at once? I do think it would help speed up review if we could break it into some smaller PRs

@ariesdevil
Copy link
Contributor Author

Hi @alamb, I understand how busy you must be maintaining multiple projects and writing an important paper simultaneously.

As we now have try_append_view implemented, we no need ViewBuffer any more, I'll rebase this PR after #5877 merged. After that, this PR should be smaller.

@alamb
Copy link
Contributor

alamb commented Jun 13, 2024

Hi @alamb, I understand how busy you must be maintaining multiple projects and writing an important paper simultaneously.

Thank you for understanding -- I feel bad whenever PRs get stuck.

As we now have try_append_view implemented, we no need ViewBuffer any more, I'll rebase this PR after #5877 merged. After that, this PR should be smaller.

That is awesome. THank you. I will make every effort to help get this over the line and now that @XiangpengHao is helping I am optimistic we can get it in shortly

@alamb
Copy link
Contributor

alamb commented Jun 13, 2024

#5877 is merged now!

@ariesdevil ariesdevil force-pushed the parquet branch 2 times, most recently from 5730aa0 to 7fe2847 Compare June 15, 2024 02:46
@ariesdevil
Copy link
Contributor Author

closed by #5877

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement StringViewArray and BinaryViewArray reading/writing in parquet
6 participants