Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Assertion Error: "Can't claim the queue is finished with the active batch!" #148

Open
Line-fr opened this issue Mar 20, 2025 · 2 comments

Comments

@Line-fr
Copy link

Line-fr commented Mar 20, 2025

Problem Description

I am here to report an error that @mikesulsenti has been having consistently
I already saw this error randomly with another user. Here is the error:

Image

There is some context to this which I will give. The error happen while using

https://codeberg.org/Kosaka/ssimulacrapy

with backend Vship

https://github.com/Line-fr/Vship

which itself uses vapoursynth. The issue then clearly arise in vship, in the part that computes SSIMU2 (check src/ssimu2/main.hpp)

as the dev of Vship, I can provide a bit more detail about the inner working to help solve the issue:

Vapoursynth will launch multiples threads, each thread will get an hipStream_t associated to it
Vship will launch every command as async except hipmalloc, hipfree and an event synchronization at the end to retrieve the score for a given frame.

I believe this issue is related to the stream managment but I am not really knowledgeable about the internals of ROCm.

I am affraid I cannot do much in my code to clear that issue, even more since it never happened to me myself and it is the first time we have been able to get this error in a consistent way.

I Hope that this issue will be useful.
Thank you for your time and the nice job!

Operating System

CachyOS Linux

CPU

AMD Ryzen 5 7600X 6-Core Processor

GPU

AMD Radeon RX 7900 XTX

ROCm Version

6.3.2-2

ROCm Component

clr

Steps to Reproduce

I do not know how to reproduce it sadly

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @Line-fr. Internal ticket has been created to investigate this issue. Thanks!

@mikesulsenti
Copy link

mikesulsenti commented Mar 24, 2025

As an update, I now have ROCm version 6.3.3-1.1 installed, re-compiled, and the same error still occurs

python: /usr/src/debug/hip-runtime/hip-runtime-clr/rocclr/platform/commandqueue.cpp:139: void amd::HostQueue::finish(bool): Assertion `GetSubmissionBatch() == nullptr && "Can't claim the 
queue is finished with the active batch!"' failed.

If any additional system info is desired, let me know
I'm also including a pastebin of the output of /opt/rocm/bin/rocminfo --support: https://rentry.co/82woyqdn

And lspci -v:

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] (rev c8) (prog-if 00 [VGA controller])
        Subsystem: XFX Limited RX-79XMERCB9 [SPEEDSTER MERC 310 RX 7900 XTX]
        Flags: bus master, fast devsel, latency 0, IRQ 120
        Memory at f000000000 (64-bit, prefetchable) [size=32G]
        Memory at f800000000 (64-bit, prefetchable) [size=256M]
        I/O ports at f000 [size=256]
        Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
        Expansion ROM at fcb00000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150] Advanced Error Reporting
        Capabilities: [200] Physical Resizable BAR
        Capabilities: [240] Power Budgeting <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [2a0] Access Control Services
        Capabilities: [2d0] Process Address Space ID (PASID)
        Capabilities: [320] Latency Tolerance Reporting
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [450] Lane Margining at the Receiver
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

rahulc1984 pushed a commit that referenced this issue Apr 12, 2025
* SWDEV-517078 - Maintain the trap handler ABI version in CLR

The trap handler ABI version is communicated to the debugger using
the r_version field in the r_debug structure.  This structure is
an external dependency, which makes it complicated to keep the trap
handler source (in CRL) and the ABI version number (external dependency)
in sync.

This patch proposes to patch the trap handler ABI version number in
_amdgpu_r_debug before communicating it to the debugger.

We can't directly include sc's executable.hpp file in CRL as it relies
on conflicting definition of ELF related types, so instead we need to
rely on a-priori knowledge on the r_debug structure.  Fortunately, this
structure is part of a stable ABI, so its layout is guaranteed to be
kept stable.

Update the 2nd level trap handler to follow updates from the
ROCr-runtime.  The trap handlers are stripped from parts dedicated to
architectures unsupported by CLR.

Bump the r_debug.r_version to track the ABI changes in the trap handler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants