[BUG] ChunkedPacker can fail at construction or runtime if CUDA runs OOM for kernels #11694

abellina · 2024-11-05T17:01:57Z

We recently saw what looks like a CUDA OOM (not your classic RMM OOM) with the stack below.

It looks like a kernel exclusive_scan_by_key failed to launch. This seems to be that we have a job where our reserve memory (640MB by default, and configured by spark.rapids.memory.gpu.reserve) is either not respected or not enough to launch all the kernels that are needed by CUDA. We think that we could have synchronized to force the ASYNC allocator to return to within its thresholds, but we are not sure what the guarantees are. We also need to repro this independently to properly handle and document it.

24/08/01 23:01:20 INFO DeviceMemoryEventHandler: Device allocation of 60625520 bytes failed, device store has 5003256884 total and 1560204129 spillable bytes. First attempt. Total RMM allocated is 9108818176 bytes. 
24/08/01 23:01:20 WARN RapidsDeviceMemoryStore: Targeting a device memory size of 1499578609. Current total 5003256884. Current spillable 1560204129
24/08/01 23:01:20 WARN RapidsDeviceMemoryStore: device memory store spilling to reduce usage from 5003256884 total (1560204129 spillable) to 1499578609 bytes 
24/08/01 23:01:20 ERROR DeviceMemoryEventHandler: Error handling allocation failure 
ai.rapids.cudf.CudfException: after dispatching exclusive_scan_by_key kernel: cudaErrorMemoryAllocation: out of memory 
	at ai.rapids.cudf.Table.makeChunkedPack(Native Method) 
	at ai.rapids.cudf.Table.makeChunkedPack(Table.java:2672) 
	at com.nvidia.spark.rapids.ChunkedPacker.$anonfun$chunkedPack$1(RapidsBuffer.scala:97) 
	at scala.Option.flatMap(Option.scala:271) 
	at com.nvidia.spark.rapids.ChunkedPacker.liftedTree1$1(RapidsBuffer.scala:96) 
	at com.nvidia.spark.rapids.ChunkedPacker.<init>(RapidsBuffer.scala:95) 
	at com.nvidia.spark.rapids.RapidsDeviceMemoryStore$RapidsTable.makeChunkedPacker(RapidsDeviceMemoryStore.scala:272) 
	at com.nvidia.spark.rapids.RapidsBufferCopyIterator.<init>(RapidsBuffer.scala:180) 
	at com.nvidia.spark.rapids.RapidsBuffer.getCopyIterator(RapidsBuffer.scala:248) 
	at com.nvidia.spark.rapids.RapidsBuffer.getCopyIterator$(RapidsBuffer.scala:247) 
	at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.getCopyIterator(RapidsBufferStore.scala:405) 
	at com.nvidia.spark.rapids.RapidsHostMemoryStore.createBuffer(RapidsHostMemoryStore.scala:133) 
	at com.nvidia.spark.rapids.RapidsBufferStore.copyBuffer(RapidsBufferStore.scala:224) 
	at com.nvidia.spark.rapids.RapidsBufferStore.spillBuffer(RapidsBufferStore.scala:374) 
	at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$4(RapidsBufferStore.scala:311) 
	at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$4$adapted(RapidsBufferStore.scala:304) 
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30) 
	at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$2(RapidsBufferStore.scala:304) 
	at com.nvidia.spark.rapids.RapidsBufferStore.$anonfun$synchronousSpill$2$adapted(RapidsBufferStore.scala:290) 
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30) 
	at com.nvidia.spark.rapids.RapidsBufferStore.synchronousSpill(RapidsBufferStore.scala:290) 
	at com.nvidia.spark.rapids.RapidsBufferCatalog.synchronousSpill(RapidsBufferCatalog.scala:614) 
	at com.nvidia.spark.rapids.DeviceMemoryEventHandler.onAllocFailure(DeviceMemoryEventHandler.scala:154) 
	at ai.rapids.cudf.Table.groupByAggregate(Native Method) 
	at ai.rapids.cudf.Table.access$3300(Table.java:41) 
	at ai.rapids.cudf.Table$GroupByOperation.aggregate(Table.java:4099)

The text was updated successfully, but these errors were encountered:

abellina added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ChunkedPacker can fail at construction or runtime if CUDA runs OOM for kernels #11694

[BUG] ChunkedPacker can fail at construction or runtime if CUDA runs OOM for kernels #11694

abellina commented Nov 5, 2024

[BUG] ChunkedPacker can fail at construction or runtime if CUDA runs OOM for kernels #11694

[BUG] ChunkedPacker can fail at construction or runtime if CUDA runs OOM for kernels #11694

Comments

abellina commented Nov 5, 2024