Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jun 6, 2024

Rationale for this change

The various Python reprs or the C++ PrettyPrint functions will currently just segfault when passing an object that has its data on a non-CPU device. In python, getting a segfault while displaying the object is very annoying, and so we should make this at least not crash.

What changes are included in this PR?

When we detect data on a non-CPU device passed to PrettyPrint, we copy the necessary part (the full Arrays for Array/RecordBatch, or the full chunks that are being printed for ChunkedArray/Table) to the default CPU device, and then use the existing print utilities as is on this copied subset.

For large data, this can be potentially costly by copying a lot of data (but you can always avoid that by not printing the data), but for chunked data we will still only copy those chunks of the full dataset needed to print the object.
Longer term, we should investigate if we can actually copy sliced arrays to a different device (with actual pruning of the buffers while copying): #43055

Are these changes tested?

Yes

Are there any user-facing changes?

No

Copy link

github-actions bot commented Jun 6, 2024

⚠️ GitHub issue #41664 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member Author

@pitrou I started looking into this on the C++ side, but with the current approach of slice+copy+concat, I could also easily do this just on the Python side before passing the data to PrettyPrint. But do you think this is useful to keep on the C++ side?

@pitrou
Copy link
Member

pitrou commented Jun 6, 2024

Yes, it's certainly useful on the CPU side as well.

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review June 12, 2024 14:28
Comment on lines +813 to +823
def make_chunked_array(n_elements_per_chunk, n_chunks):
arrs = []
carrs = []
for _ in range(n_chunks):
batch = make_recordbatch(n_elements_per_chunk)
cbuf = cuda.serialize_record_batch(batch, global_context)
cbatch = cuda.read_record_batch(cbuf, batch.schema)
arrs.append(batch["f0"])
carrs.append(cbatch["f0"])

return pa.chunked_array(arrs), pa.chunked_array(carrs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a bit verbose, but can be simplified once the "copy to device" functionality is exposed (should maybe do that first now)

@jorisvandenbossche
Copy link
Member Author

I still need to add C++ tests, but this is all working from testing it with pyarrow.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 12, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-cuda-python

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 12, 2024
Copy link

Revision: 9ab291d

Submitted crossbow builds: ursacomputing/crossbow @ actions-b0c4fb38d4

Task Status
test-cuda-python GitHub Actions

Comment on lines 417 to 427
Result<std::shared_ptr<Array>> CopyStartEndToCPU(const Array& arr, int window) {
std::shared_ptr<Array> arr_sliced;
if (arr.length() > (2 * window + 1)) {
ARROW_ASSIGN_OR_RAISE(auto arr_start,
arr.Slice(0, window + 1)->CopyTo(default_cpu_memory_manager()));
ARROW_ASSIGN_OR_RAISE(
auto arr_end,
arr.Slice(arr.length() - window - 1)->CopyTo(default_cpu_memory_manager()));
ARROW_ASSIGN_OR_RAISE(arr_sliced, Concatenate({arr_start, arr_end}));
} else {
ARROW_ASSIGN_OR_RAISE(arr_sliced, arr.CopyTo(default_cpu_memory_manager()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I am using the slice+copy approach above in the idea of only copying a small part of the full array to the CPU device. However, on second though, I assume that this actually doesn't work as intended? The Slice is zero-copy (just adding offsets), and I assume the CopyTo just naively copies all buffers of the array in its entirety, not actually truncating and copying only the part of the buffer that is needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, perhaps @felipecrv can answer these questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyTo just naively copies all buffers of the array in its entirety, not actually truncating and copying only the part of the buffer that is needed?

Yes, because MemoryManagers work in terms of Buffers, not ranges. We recently added a function that can copy slices to CPU, but it falls back to full buffer copy if the source MemoryManager is not a CPU memory manager.

The protocols for memory movement between devices might not allow for range-based copies (@zeroshade might know better than me), so instead expanding the MemoryManager interface we should probably have a way to slice the array where it is (device-specific) and then copy the smaller array. (?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing a slicer in CUDA might be a ton of work though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felipecrv why does it fallback to a full buffer copy if the source MemoryManager isn't a CPU memory manager? Couldn't it just use cuMemCpy with the appropriate offset into the buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're referring to slicing the array without copying the entirety of the buffers. Yea, that's a trickier proposition. The functions exist in libcudf already to do this, but we don't link against libcudf in libarrow_cuda (nor do I think we should), but in theory we could write the necessary code or at least figure it out by looking at how libcudf does it. For now, I think that the naive approach is likely fine until someone actively states they have an issue

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think that the naive approach is likely fine until someone actively states they have an issue

So for the repr, there are basically two options for now:

  • Don't print any actual data (eg in the Array repr, instead of including the data, add a general message like "xx values on xx device")
  • Just do what I am doing here, acknowledging that the repr therefore can be costly (if this is a problem, the user can still ensure to not actually use this / print the object). In practice, very often you will have chunked data, and so for example when having a table with many smaller chunks, we will still only copy a couple of chunks.

@zeroshade I take your comment as being fine with the second option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'm fine with either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with either approach. As long as we're able to indicate the situation without crashing, we're good. Acknowledging that the repr can be costly is a good tradeoff

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 13, 2024
@jorisvandenbossche jorisvandenbossche changed the title GH-41664: [C++][Python] PrettyPrint non-cpu data by copying slice to default CPU device GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device Jun 20, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-cuda-python

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 20, 2024
Copy link

Revision: 595ad89

Submitted crossbow builds: ursacomputing/crossbow @ actions-8eb46fb0e7

Task Status
test-cuda-python GitHub Actions

@@ -476,8 +479,7 @@ Status PrettyPrint(const ChunkedArray& chunked_arr, const PrettyPrintOptions& op
} else {
PrettyPrintOptions chunk_options = options;
chunk_options.indent += options.indent_size;
ArrayPrinter printer(chunk_options, sink);
RETURN_NOT_OK(printer.Print(*chunked_arr.chunk(i)));
RETURN_NOT_OK(PrettyPrint(*chunked_arr.chunk(i), chunk_options, sink));
Copy link
Contributor

@felipecrv felipecrv Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cleaner (and more general) fix here would be changing Status Print(const Array& array) { ... } right before RETURN_NOT_OK(VisitArrayInline(array, this));

@pitrou
Copy link
Member

pitrou commented Jun 25, 2024

we copy the necessary part (window size of start and end of array) to the default CPU device

Where does this happen? Judging by the PR diff, it seems it is copying the entire contents, doesn't it?

@jorisvandenbossche
Copy link
Member Author

Judging by the PR diff, it seems it is copying the entire contents, doesn't it?

Sorry, I should have updated the top description of the PR. I initially thought I was doing that (copying only the necessary parts), but that's not actually what copying a sliced array does, see the discussion in #42010 (comment)
(will update the top post now)

@pitrou
Copy link
Member

pitrou commented Jun 25, 2024

I mean that the array isn't sliced at the point where you're copying it, is it?

@jorisvandenbossche
Copy link
Member Author

I was doing that in the initial version (that the now updated top post described), but since slicing doesn't reduce the amount of data being copied, I removed that (595ad89) (again see the discussion at #42010 (comment))

@pitrou
Copy link
Member

pitrou commented Jun 25, 2024

Ahah, I see. We should open followup issues then, if not already done!

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 25, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit test-cuda-python

@jorisvandenbossche
Copy link
Member Author

Opened #43055 as a follow-up and linked that from the top post (and will add a TODO mentioning that issue in a comment)

Copy link

Revision: 1b64ccf

Submitted crossbow builds: ursacomputing/crossbow @ actions-1a38c72745

Task Status
test-cuda-python GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 26, 2024
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 26, 2024
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 26, 2024
@jorisvandenbossche jorisvandenbossche merged commit e2b0de2 into apache:main Jun 26, 2024
35 of 39 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Jun 26, 2024
@jorisvandenbossche jorisvandenbossche deleted the gh-41664-pretty-print-non-cpu branch June 26, 2024 15:33
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit e2b0de2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 38 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Jul 9, 2024
…default CPU device (apache#42010)

### Rationale for this change

The various Python reprs or the C++ `PrettyPrint` functions will currently just segfault when passing an object that has its data on a non-CPU device. In python, getting a segfault while displaying the object is very annoying, and so we should make this at least not crash.

### What changes are included in this PR?

When we detect data on a non-CPU device passed to `PrettyPrint`, we copy the necessary part (the full Arrays for Array/RecordBatch, or the full chunks that are being printed for ChunkedArray/Table) to the default CPU device, and then use the existing print utilities as is on this copied subset.

For large data, this can be potentially costly by copying a lot of data (but you can always avoid that by not printing the data), but for chunked data we will still only copy those chunks of the full dataset needed to print the object. 
Longer term, we should investigate if we can actually copy sliced arrays to a different device (with actual pruning of the buffers while copying): apache#43055

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: apache#41664

Lead-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants