Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an extension that adds a device side abort function #808

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pjaaskel
Copy link
Contributor

@pjaaskel pjaaskel commented Jun 21, 2022

With the device side abort extension, a work-item (WI) can return from the kernel execution at any point and cause abnormal unrecoverable termination of the host process.

This extension differs from the cl_arm_controlled_kernel_termination extension in that the abort in the device side is expected to behave like a call to the POSIX abort() in the host side, terminating also the host process immediately.

The extension is meant to easily support POSIX abort()-like functionality on the device side, as well as serve as a basis for CUDA/HIP-style assertions.

An example CPU implementation in PoCL: parmance/pocl@d5e88c7

State explicitly that a compliant CPU device implementation can call
abort() directly. Clarified the control flow behavior.
@Kerilk
Copy link
Contributor

Kerilk commented Jul 12, 2022

Here is the accompanying SPIRV extension:
KhronosGroup/SPIRV-Registry#149

@bashbaug
Copy link
Contributor

Discussed in the September 20th teleconference:

  • Use-case: similar to __trap() in CUDA.
  • There are several similar proposals and it would be nice to consolidate into an EXT extension (this PR).
  • There is a related SPIR-V extension (linked above) but it has not been implemented (yet).
  • As-written, the extension will terminate the entire process when a device kernel calls the abort function.
  • There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?
  • For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

@pjaaskel
Copy link
Contributor Author

There could be use-cases that aren't as catastrophic. Do we want to support recovery, and if so, how recoverable can we be?

Like I suggested in the call, I think these are two different use cases which could call for separate extensions to not make it too difficult to support only one of them. The primary use case for this simple one is to be part of an assert() implementation: To allow more easy porting (even automated migration) of host functions that have asserts() or abort() calls to device-side executed code.

For example, how does calling abort affect other work-items or work-groups that may be executing? How does calling abort affect other commands in the command-queue that may be dependent on the aborted command?

In this extension, the expected behavior for commands is the same as with any multithreaded program where one thread calls the standard abort(). Other parallel threads (commands) might have proceeded further or not, but the end result is either catching SIGABRT or brutally killing the process along with its threads (in this case also device/GPU threads).

The WI semantics I tried to describe in the last paragraph: https://github.com/KhronosGroup/OpenCL-Docs/pull/808/files#diff-149e893d23663ca01188af6d03a0ebd77bae5776abefe7ec6b063e4bbd88212fR103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants