GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010

jorisvandenbossche · 2024-06-06T13:44:07Z

Rationale for this change

The various Python reprs or the C++ PrettyPrint functions will currently just segfault when passing an object that has its data on a non-CPU device. In python, getting a segfault while displaying the object is very annoying, and so we should make this at least not crash.

What changes are included in this PR?

When we detect data on a non-CPU device passed to PrettyPrint, we copy the necessary part (the full Arrays for Array/RecordBatch, or the full chunks that are being printed for ChunkedArray/Table) to the default CPU device, and then use the existing print utilities as is on this copied subset.

For large data, this can be potentially costly by copying a lot of data (but you can always avoid that by not printing the data), but for chunked data we will still only copy those chunks of the full dataset needed to print the object.
Longer term, we should investigate if we can actually copy sliced arrays to a different device (with actual pruning of the buffers while copying): #43055

Are these changes tested?

Yes

Are there any user-facing changes?

No

GitHub Issue: [Python] Implement a repr for Array and RecordBatch/Table for non-CPU data #41664

…CPU device

github-actions · 2024-06-06T13:44:43Z

⚠️ GitHub issue #41664 has been automatically assigned in GitHub to PR creator.

jorisvandenbossche · 2024-06-06T13:45:21Z

@pitrou I started looking into this on the C++ side, but with the current approach of slice+copy+concat, I could also easily do this just on the Python side before passing the data to PrettyPrint. But do you think this is useful to keep on the C++ side?

pitrou · 2024-06-06T13:53:39Z

Yes, it's certainly useful on the CPU side as well.

jorisvandenbossche · 2024-06-12T14:31:04Z

python/pyarrow/tests/test_cuda.py

+def make_chunked_array(n_elements_per_chunk, n_chunks):
+    arrs = []
+    carrs = []
+    for _ in range(n_chunks):
+        batch = make_recordbatch(n_elements_per_chunk)
+        cbuf = cuda.serialize_record_batch(batch, global_context)
+        cbatch = cuda.read_record_batch(cbuf, batch.schema)
+        arrs.append(batch["f0"])
+        carrs.append(cbatch["f0"])
+
+    return pa.chunked_array(arrs), pa.chunked_array(carrs)


This is still a bit verbose, but can be simplified once the "copy to device" functionality is exposed (should maybe do that first now)

jorisvandenbossche · 2024-06-12T14:31:12Z

I still need to add C++ tests, but this is all working from testing it with pyarrow.

…ty-print-non-cpu

jorisvandenbossche · 2024-06-12T14:33:44Z

@github-actions crossbow submit test-cuda-python

github-actions · 2024-06-12T14:36:00Z

Revision: 9ab291d

Submitted crossbow builds: ursacomputing/crossbow @ actions-b0c4fb38d4

Task	Status
test-cuda-python

jorisvandenbossche · 2024-06-13T12:58:16Z

cpp/src/arrow/pretty_print.cc

+Result<std::shared_ptr<Array>> CopyStartEndToCPU(const Array& arr, int window) {
+  std::shared_ptr<Array> arr_sliced;
+  if (arr.length() > (2 * window + 1)) {
+    ARROW_ASSIGN_OR_RAISE(auto arr_start,
+                          arr.Slice(0, window + 1)->CopyTo(default_cpu_memory_manager()));
+    ARROW_ASSIGN_OR_RAISE(
+        auto arr_end,
+        arr.Slice(arr.length() - window - 1)->CopyTo(default_cpu_memory_manager()));
+    ARROW_ASSIGN_OR_RAISE(arr_sliced, Concatenate({arr_start, arr_end}));
+  } else {
+    ARROW_ASSIGN_OR_RAISE(arr_sliced, arr.CopyTo(default_cpu_memory_manager()));


@pitrou I am using the slice+copy approach above in the idea of only copying a small part of the full array to the CPU device. However, on second though, I assume that this actually doesn't work as intended? The Slice is zero-copy (just adding offsets), and I assume the CopyTo just naively copies all buffers of the array in its entirety, not actually truncating and copying only the part of the buffer that is needed?

Hmm, perhaps @felipecrv can answer these questions.

CopyTo just naively copies all buffers of the array in its entirety, not actually truncating and copying only the part of the buffer that is needed?

Yes, because MemoryManagers work in terms of Buffers, not ranges. We recently added a function that can copy slices to CPU, but it falls back to full buffer copy if the source MemoryManager is not a CPU memory manager.

The protocols for memory movement between devices might not allow for range-based copies (@zeroshade might know better than me), so instead expanding the MemoryManager interface we should probably have a way to slice the array where it is (device-specific) and then copy the smaller array. (?)

Writing a slicer in CUDA might be a ton of work though.

@felipecrv why does it fallback to a full buffer copy if the source MemoryManager isn't a CPU memory manager? Couldn't it just use cuMemCpy with the appropriate offset into the buffer?

Oh, you're referring to slicing the array without copying the entirety of the buffers. Yea, that's a trickier proposition. The functions exist in libcudf already to do this, but we don't link against libcudf in libarrow_cuda (nor do I think we should), but in theory we could write the necessary code or at least figure it out by looking at how libcudf does it. For now, I think that the naive approach is likely fine until someone actively states they have an issue

For now, I think that the naive approach is likely fine until someone actively states they have an issue

So for the repr, there are basically two options for now:

Don't print any actual data (eg in the Array repr, instead of including the data, add a general message like "xx values on xx device")

Just do what I am doing here, acknowledging that the repr therefore can be costly (if this is a problem, the user can still ensure to not actually use this / print the object). In practice, very often you will have chunked data, and so for example when having a table with many smaller chunks, we will still only copy a couple of chunks.

@zeroshade I take your comment as being fine with the second option?

And I'm fine with either.

I'm fine with either approach. As long as we're able to indicate the situation without crashing, we're good. Acknowledging that the repr can be costly is a good tradeoff

…ty-print-non-cpu

jorisvandenbossche · 2024-06-20T16:56:18Z

@github-actions crossbow submit test-cuda-python

github-actions · 2024-06-20T16:58:34Z

Revision: 595ad89

Submitted crossbow builds: ursacomputing/crossbow @ actions-8eb46fb0e7

Task	Status
test-cuda-python

felipecrv · 2024-06-21T00:14:38Z

cpp/src/arrow/pretty_print.cc

@@ -476,8 +479,7 @@ Status PrettyPrint(const ChunkedArray& chunked_arr, const PrettyPrintOptions& op
    } else {
      PrettyPrintOptions chunk_options = options;
      chunk_options.indent += options.indent_size;
-      ArrayPrinter printer(chunk_options, sink);
-      RETURN_NOT_OK(printer.Print(*chunked_arr.chunk(i)));
+      RETURN_NOT_OK(PrettyPrint(*chunked_arr.chunk(i), chunk_options, sink));


A cleaner (and more general) fix here would be changing Status Print(const Array& array) { ... } right before RETURN_NOT_OK(VisitArrayInline(array, this));

pitrou · 2024-06-25T16:55:05Z

we copy the necessary part (window size of start and end of array) to the default CPU device

Where does this happen? Judging by the PR diff, it seems it is copying the entire contents, doesn't it?

jorisvandenbossche · 2024-06-25T18:17:45Z

Judging by the PR diff, it seems it is copying the entire contents, doesn't it?

Sorry, I should have updated the top description of the PR. I initially thought I was doing that (copying only the necessary parts), but that's not actually what copying a sliced array does, see the discussion in #42010 (comment)
(will update the top post now)

pitrou · 2024-06-25T18:19:43Z

I mean that the array isn't sliced at the point where you're copying it, is it?

jorisvandenbossche · 2024-06-25T18:26:50Z

I was doing that in the initial version (that the now updated top post described), but since slicing doesn't reduce the amount of data being copied, I removed that (595ad89) (again see the discussion at #42010 (comment))

pitrou · 2024-06-25T18:27:45Z

Ahah, I see. We should open followup issues then, if not already done!

felipecrv

LGTM

…ty-print-non-cpu

jorisvandenbossche · 2024-06-26T09:21:26Z

@github-actions crossbow submit test-cuda-python

jorisvandenbossche · 2024-06-26T09:22:12Z

Opened #43055 as a follow-up and linked that from the top post (and will add a TODO mentioning that issue in a comment)

github-actions · 2024-06-26T09:23:40Z

Revision: 1b64ccf

Submitted crossbow builds: ursacomputing/crossbow @ actions-1a38c72745

Task	Status
test-cuda-python

cpp/src/arrow/pretty_print.cc

Co-authored-by: Felipe Oliveira Carvalho <[email protected]>

conbench-apache-arrow · 2024-06-27T01:08:08Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit e2b0de2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 38 possible false positives for unstable benchmarks that are known to sometimes produce them.

…default CPU device (apache#42010) ### Rationale for this change The various Python reprs or the C++ `PrettyPrint` functions will currently just segfault when passing an object that has its data on a non-CPU device. In python, getting a segfault while displaying the object is very annoying, and so we should make this at least not crash. ### What changes are included in this PR? When we detect data on a non-CPU device passed to `PrettyPrint`, we copy the necessary part (the full Arrays for Array/RecordBatch, or the full chunks that are being printed for ChunkedArray/Table) to the default CPU device, and then use the existing print utilities as is on this copied subset. For large data, this can be potentially costly by copying a lot of data (but you can always avoid that by not printing the data), but for chunked data we will still only copy those chunks of the full dataset needed to print the object. Longer term, we should investigate if we can actually copy sliced arrays to a different device (with actual pruning of the buffers while copying): apache#43055 ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: apache#41664 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

apacheGH-41664: PrettyPrint non-cpu data by copying slice to default …

7890c00

…CPU device

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels Jun 6, 2024

fix for chunked array + add python tests

050fd9d

github-actions bot added the Component: Python label Jun 12, 2024

jorisvandenbossche marked this pull request as ready for review June 12, 2024 14:28

jorisvandenbossche commented Jun 12, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 12, 2024

jorisvandenbossche added 2 commits June 12, 2024 16:33

formatting

1c09d52

Merge remote-tracking branch 'upstream/main' into apachegh-41664-pret…

9ab291d

…ty-print-non-cpu

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 12, 2024

jorisvandenbossche commented Jun 13, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 13, 2024

jorisvandenbossche added 2 commits June 20, 2024 18:52

Merge remote-tracking branch 'upstream/main' into apachegh-41664-pret…

85fb619

…ty-print-non-cpu

simplify approach (just copy full array instead of slice+concat)

595ad89

jorisvandenbossche changed the title ~~GH-41664: [C++][Python] PrettyPrint non-cpu data by copying slice to default CPU device~~ GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device Jun 20, 2024

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 20, 2024

felipecrv reviewed Jun 21, 2024

View reviewed changes

felipecrv approved these changes Jun 25, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 25, 2024

jorisvandenbossche added 2 commits June 26, 2024 11:03

Merge remote-tracking branch 'upstream/main' into apachegh-41664-pret…

b129fdc

…ty-print-non-cpu

remove assert_cpu now repr work

1b64ccf

add TODO comment

fe82b3e

felipecrv reviewed Jun 26, 2024

View reviewed changes

cpp/src/arrow/pretty_print.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 26, 2024

Update cpp/src/arrow/pretty_print.cc

c65c7a5

Co-authored-by: Felipe Oliveira Carvalho <[email protected]>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 26, 2024

felipecrv approved these changes Jun 26, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 26, 2024

jorisvandenbossche merged commit e2b0de2 into apache:main Jun 26, 2024
35 of 39 checks passed

jorisvandenbossche removed the awaiting merge Awaiting merge label Jun 26, 2024

jorisvandenbossche mentioned this pull request Jun 26, 2024

[Python] Implement a repr for Array and RecordBatch/Table for non-CPU data #41664

Closed

jorisvandenbossche deleted the gh-41664-pretty-print-non-cpu branch June 26, 2024 15:33

jorisvandenbossche mentioned this pull request Aug 1, 2024

[C++] Use ViewOrCopyTo instead of CopyTo when pretty printing non-CPU data #43507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010

GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010

jorisvandenbossche commented Jun 6, 2024 •

edited

Loading

github-actions bot commented Jun 6, 2024

jorisvandenbossche commented Jun 6, 2024

pitrou commented Jun 6, 2024

jorisvandenbossche Jun 12, 2024

jorisvandenbossche commented Jun 12, 2024

jorisvandenbossche commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

jorisvandenbossche Jun 13, 2024

pitrou Jun 13, 2024

felipecrv Jun 13, 2024

felipecrv Jun 13, 2024

zeroshade Jun 17, 2024

zeroshade Jun 17, 2024

jorisvandenbossche Jun 19, 2024 •

edited

Loading

felipecrv Jun 19, 2024

zeroshade Jun 20, 2024

jorisvandenbossche commented Jun 20, 2024

github-actions bot commented Jun 20, 2024

felipecrv Jun 21, 2024 •

edited

Loading

pitrou commented Jun 25, 2024

jorisvandenbossche commented Jun 25, 2024

pitrou commented Jun 25, 2024

jorisvandenbossche commented Jun 25, 2024

pitrou commented Jun 25, 2024

felipecrv left a comment

jorisvandenbossche commented Jun 26, 2024

jorisvandenbossche commented Jun 26, 2024

github-actions bot commented Jun 26, 2024

conbench-apache-arrow bot commented Jun 27, 2024

GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010

GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010

Conversation

jorisvandenbossche commented Jun 6, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jun 6, 2024

jorisvandenbossche commented Jun 6, 2024

pitrou commented Jun 6, 2024

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 12, 2024

jorisvandenbossche commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 20, 2024

github-actions bot commented Jun 20, 2024

felipecrv Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

pitrou commented Jun 25, 2024

jorisvandenbossche commented Jun 25, 2024

pitrou commented Jun 25, 2024

jorisvandenbossche commented Jun 25, 2024

pitrou commented Jun 25, 2024

felipecrv left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 26, 2024

jorisvandenbossche commented Jun 26, 2024

github-actions bot commented Jun 26, 2024

conbench-apache-arrow bot commented Jun 27, 2024

jorisvandenbossche commented Jun 6, 2024 •

edited

Loading

jorisvandenbossche Jun 19, 2024 •

edited

Loading

felipecrv Jun 21, 2024 •

edited

Loading