-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010
GH-41664: [C++][Python] PrettyPrint non-cpu data by copying to default CPU device #42010
Conversation
|
@pitrou I started looking into this on the C++ side, but with the current approach of slice+copy+concat, I could also easily do this just on the Python side before passing the data to |
Yes, it's certainly useful on the CPU side as well. |
def make_chunked_array(n_elements_per_chunk, n_chunks): | ||
arrs = [] | ||
carrs = [] | ||
for _ in range(n_chunks): | ||
batch = make_recordbatch(n_elements_per_chunk) | ||
cbuf = cuda.serialize_record_batch(batch, global_context) | ||
cbatch = cuda.read_record_batch(cbuf, batch.schema) | ||
arrs.append(batch["f0"]) | ||
carrs.append(cbatch["f0"]) | ||
|
||
return pa.chunked_array(arrs), pa.chunked_array(carrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a bit verbose, but can be simplified once the "copy to device" functionality is exposed (should maybe do that first now)
I still need to add C++ tests, but this is all working from testing it with pyarrow. |
@github-actions crossbow submit test-cuda-python |
Revision: 9ab291d Submitted crossbow builds: ursacomputing/crossbow @ actions-b0c4fb38d4
|
cpp/src/arrow/pretty_print.cc
Outdated
Result<std::shared_ptr<Array>> CopyStartEndToCPU(const Array& arr, int window) { | ||
std::shared_ptr<Array> arr_sliced; | ||
if (arr.length() > (2 * window + 1)) { | ||
ARROW_ASSIGN_OR_RAISE(auto arr_start, | ||
arr.Slice(0, window + 1)->CopyTo(default_cpu_memory_manager())); | ||
ARROW_ASSIGN_OR_RAISE( | ||
auto arr_end, | ||
arr.Slice(arr.length() - window - 1)->CopyTo(default_cpu_memory_manager())); | ||
ARROW_ASSIGN_OR_RAISE(arr_sliced, Concatenate({arr_start, arr_end})); | ||
} else { | ||
ARROW_ASSIGN_OR_RAISE(arr_sliced, arr.CopyTo(default_cpu_memory_manager())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou I am using the slice+copy approach above in the idea of only copying a small part of the full array to the CPU device. However, on second though, I assume that this actually doesn't work as intended? The Slice
is zero-copy (just adding offsets), and I assume the CopyTo
just naively copies all buffers of the array in its entirety, not actually truncating and copying only the part of the buffer that is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, perhaps @felipecrv can answer these questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyTo just naively copies all buffers of the array in its entirety, not actually truncating and copying only the part of the buffer that is needed?
Yes, because MemoryManager
s work in terms of Buffers, not ranges. We recently added a function that can copy slices to CPU, but it falls back to full buffer copy if the source MemoryManager
is not a CPU memory manager.
The protocols for memory movement between devices might not allow for range-based copies (@zeroshade might know better than me), so instead expanding the MemoryManager
interface we should probably have a way to slice the array where it is (device-specific) and then copy the smaller array. (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing a slicer in CUDA might be a ton of work though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felipecrv why does it fallback to a full buffer copy if the source MemoryManager
isn't a CPU memory manager? Couldn't it just use cuMemCpy
with the appropriate offset into the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're referring to slicing the array without copying the entirety of the buffers. Yea, that's a trickier proposition. The functions exist in libcudf already to do this, but we don't link against libcudf in libarrow_cuda (nor do I think we should), but in theory we could write the necessary code or at least figure it out by looking at how libcudf does it. For now, I think that the naive approach is likely fine until someone actively states they have an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think that the naive approach is likely fine until someone actively states they have an issue
So for the repr, there are basically two options for now:
- Don't print any actual data (eg in the Array repr, instead of including the data, add a general message like "xx values on xx device")
- Just do what I am doing here, acknowledging that the repr therefore can be costly (if this is a problem, the user can still ensure to not actually use this / print the object). In practice, very often you will have chunked data, and so for example when having a table with many smaller chunks, we will still only copy a couple of chunks.
@zeroshade I take your comment as being fine with the second option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'm fine with either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with either approach. As long as we're able to indicate the situation without crashing, we're good. Acknowledging that the repr can be costly is a good tradeoff
@github-actions crossbow submit test-cuda-python |
Revision: 595ad89 Submitted crossbow builds: ursacomputing/crossbow @ actions-8eb46fb0e7
|
cpp/src/arrow/pretty_print.cc
Outdated
@@ -476,8 +479,7 @@ Status PrettyPrint(const ChunkedArray& chunked_arr, const PrettyPrintOptions& op | |||
} else { | |||
PrettyPrintOptions chunk_options = options; | |||
chunk_options.indent += options.indent_size; | |||
ArrayPrinter printer(chunk_options, sink); | |||
RETURN_NOT_OK(printer.Print(*chunked_arr.chunk(i))); | |||
RETURN_NOT_OK(PrettyPrint(*chunked_arr.chunk(i), chunk_options, sink)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cleaner (and more general) fix here would be changing Status Print(const Array& array) { ... }
right before RETURN_NOT_OK(VisitArrayInline(array, this));
Where does this happen? Judging by the PR diff, it seems it is copying the entire contents, doesn't it? |
Sorry, I should have updated the top description of the PR. I initially thought I was doing that (copying only the necessary parts), but that's not actually what copying a sliced array does, see the discussion in #42010 (comment) |
I mean that the array isn't sliced at the point where you're copying it, is it? |
I was doing that in the initial version (that the now updated top post described), but since slicing doesn't reduce the amount of data being copied, I removed that (595ad89) (again see the discussion at #42010 (comment)) |
Ahah, I see. We should open followup issues then, if not already done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@github-actions crossbow submit test-cuda-python |
Opened #43055 as a follow-up and linked that from the top post (and will add a TODO mentioning that issue in a comment) |
Revision: 1b64ccf Submitted crossbow builds: ursacomputing/crossbow @ actions-1a38c72745
|
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit e2b0de2. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 38 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…default CPU device (apache#42010) ### Rationale for this change The various Python reprs or the C++ `PrettyPrint` functions will currently just segfault when passing an object that has its data on a non-CPU device. In python, getting a segfault while displaying the object is very annoying, and so we should make this at least not crash. ### What changes are included in this PR? When we detect data on a non-CPU device passed to `PrettyPrint`, we copy the necessary part (the full Arrays for Array/RecordBatch, or the full chunks that are being printed for ChunkedArray/Table) to the default CPU device, and then use the existing print utilities as is on this copied subset. For large data, this can be potentially costly by copying a lot of data (but you can always avoid that by not printing the data), but for chunked data we will still only copy those chunks of the full dataset needed to print the object. Longer term, we should investigate if we can actually copy sliced arrays to a different device (with actual pruning of the buffers while copying): apache#43055 ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: apache#41664 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Rationale for this change
The various Python reprs or the C++
PrettyPrint
functions will currently just segfault when passing an object that has its data on a non-CPU device. In python, getting a segfault while displaying the object is very annoying, and so we should make this at least not crash.What changes are included in this PR?
When we detect data on a non-CPU device passed to
PrettyPrint
, we copy the necessary part (the full Arrays for Array/RecordBatch, or the full chunks that are being printed for ChunkedArray/Table) to the default CPU device, and then use the existing print utilities as is on this copied subset.For large data, this can be potentially costly by copying a lot of data (but you can always avoid that by not printing the data), but for chunked data we will still only copy those chunks of the full dataset needed to print the object.
Longer term, we should investigate if we can actually copy sliced arrays to a different device (with actual pruning of the buffers while copying): #43055
Are these changes tested?
Yes
Are there any user-facing changes?
No