-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry with SpillableHostBuffer to GPU scans #12233
base: branch-25.04
Are you sure you want to change the base?
Add retry with SpillableHostBuffer to GPU scans #12233
Conversation
Signed-off-by: Firestarman <[email protected]>
related to #8890 |
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala
Show resolved
Hide resolved
@@ -1134,7 +1139,10 @@ abstract class MultiFileCoalescingPartitionReaderBase( | |||
logDebug(s"$getFileFormatShortName Coalescing reading estimates the initTotalSize:" + | |||
s" $initTotalSize, and the true size: $finalBufferSize") | |||
withResource(finalBuffer) { _ => | |||
finalBuffer.slice(0, finalBufferSize) | |||
SpillableHostBuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this PR, but it reminds me that generally speaking, SpillableHostBuffer on a sliced HostMemoryBuffer could be, IMHO, buggy in some cases @abellina @revans2 . Let's say originally I have a HostMemoryBuffer x, then another y sliced from x. When y is spilled and read from disk again, y is no longer sharing underlying buffer with x anymore, right? If the code writer assumes any content written to x is visible to y, then SpillingFramework is causing a bug? Should we make sliced HostMemoryBuffer non-spillable at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But marking sliced HostMemoryBuffer non-spillable at all has side effects, too. This will essentially make the original HostMemoryBuffer x also unspillable, because it always have another reference coming from the sliced HostMemoryBuffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, SpillableHostBuffer on a sliced HostMemoryBuffer would mistakenly increase totalSize of SpillableHostStore:
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala
Line 1363 in b1e0fd6
def trackNoSpill(handle: HostSpillableHandle[_]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we are slicing only to "right size" the buffer. The alternative is to memcpy it out to another right-sized buffer or to create a spillable buffer that knows about the slice, or track the "right size" alongside it for later slicing.
In terms of tracking the memory, we'll be undercounting it because we have 0->finalBufferSize, not the whole length. We will let it go and spill it because we hold onto the only reference in the spill framework. When we bring it back after spill it will be of size finalBufferSize
, not the original, but I think this is OK because we didn't mean to keep the original.
So I think the main issue is undercounting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the discussion, so should I make any update to fix this "undercounting"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abellina For this case I think it's okay to keep the code as it is, because we can assume finalBufferSize is close to totalBufferSize. But I brought up this question to discuss the general cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I fully get what is going on here. i will resolve this by "tracking the 'right size' alongside it for later slicing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Signed-off-by: Firestarman <[email protected]>
0eea80e
build |
contribute to #8890
This PR adds the support of retry with
SpillableHostBuffer
into GPU scans when reading the data from files to host, to try to resolve some CPU OOMs in GPU scans, which are reported from some customer queries. And these host memory allocations run without the protection of the retry framework.The OOM error is like as the below.
This change can be covered by existing tests.