-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: HEVC decoding fails on DG1 when using upstream kernel instead of Intel DKMS #1415
Comments
I've also tested Sysman functionality and simple OpenCL programs. Those work fine, so in general drm-tip kernel seems to work fine. Btw. media-driver README still states following:
Although AFAIK that has not been true for DG1 for over a half a year, since this media-driver commit: db5a870 And DG2 / ATS-M support being already in public kernel (for some of their variants, and requiring force-probing for now). |
from 22'Q1 release, media_driver does not support DG1 with ENABLE_PRODUCTION_KMD=ON anymore |
The note about outdated DG1 info in README was just FYI. I am using ENABLE_PRODUCTION_KMD=OFF with public kernel, and that fails for me on DG1 with HEVC. (There may have been 3D + AVC transode running at the same time in the backend while I was running this test-case, but that should not have broken HEVC as dmesg does not show any errors.) |
Hi @eero-t, I noticed i915.enable_guc=3 in your kernel parameters, could you try i915.enable_guc=2? Seems GuC submission doesn't work on DG1. |
Hi @eero-t , media still have some issues on drm-tip KMD for DG1. |
At the moment there are 2 possible ways to setup DG1:
That's up to the user to decide which kernel to use. However, in both cases, user is NOT supposed to adjust @eero-t : I strongly suggest to drop |
Good catch. I'll test upstream kernel without the GuC option tomorrow, and report back. (I've intended to clean that out, but had forgotten to do it for all kernel configs on all machines.) |
@dvrogozh GuC scheduling is enabled by public (yesterday) "drm-tip" kernel, even when it's not forced:
And the same issue persists. Note: I forgot to mention earlier, but in case it matters, all of these nodes have 2 0x4905 DG1 GPUs. Limiting media-driver devfs visibility just to first one (with Docker) did not change anything though. |
@Xiaogangli-intel even with GuC scheduling explicitly disabled for drm-tip:
Media driver fails:
|
When GuC scheduling is explicitly disabled, there's also a GPU hang:
See: gpu-hang.txt
Could you give pointer to more info?
Sorry, but I'm not interested about public media-driver on backport kernel, only with what's going to upstream. |
Btw. 1-2 months ago when I was testing internal KMD + UMD versions, I was seeing some instances failing with OneVPL HEVC transcode, when trying to do many parallel transcodes on DG2. I did not debug it further (container instances were changing too fast), but I'm now wondering whether it's related to HEVC issues here with public KMD+UMD versions on DG1. Are there known HEVC issues for DG2 too? |
Hi @eero-t, I noticed the hang issue of HEVC decode. If you really mind using backport kernel, maybe we have to sync with KMD to check the progress of DG1 patches upstreaming. |
DG1 has been enabled in upstream kernel (not just drm-tip) for a long time: https://github.com/torvalds/linux/blob/master/include/drm/i915_pciids.h#L630 But kernel docs RFC section still mentions several items: https://www.kernel.org/doc/html/latest/gpu/rfc/index.html I've asked whether they've landed already upstream (in Linus' tree i.e. should docs have been moved out of RFC section), not just in public drm-tip that I was testing (and with which I was seeing the issues). |
According to kernel side, status specified in RFC docs applies both to public upstream and drm-tip. I.e. there are still significant gaps in kernel i915 dGPU support, although GuC scheduling has already been enabled by default. PS. I just tested latest media driver stack releases, and e.g. FFmpeg still gives 2 FPS with drm-tip (instead of the expected hundreds of FPS). I haven't updated the kernel side though (will probably do that late summer, when 5.19 nears release). |
I tested yesterday's drm-tip 5.19-rc7 (and few days earlier 5.19-rc6) on DG1, and things have gone downhill. Instead of 2 FPS HEVC decode, there are lots of failures now with FFmpeg / VA-API:
(No errors in dmesgs though.) When using FFmpeg with QSV instead of VA-API, it fails immediately:
However, exactly the same drm-tip kernel, user-space [1] and test-case still work fine on TGL (with perf in hundreds of FPS). [1] User-space components:
TGL dmesg content:
DG1 dmesg content:
I.e. the main differences are there being 2x DG1 devices, with GuC scheduling being enabled (by default), and THP being enabled only on TGL for some reason, although both have VT-d active. |
Tested media stack components build from latest release tags (Ubuntu 22.04 based container):
And both the FFmpeg VA-API and OneVPL decoding failures are still there, both with slightly older drm-tip v6.0-rc3 kernel, and v6.0-rc5 from yesterday. OneVPL / MFX error message has changed to match what FFmpeg / VA-API was reporting:
|
drm-tip kernel dmesg shows this on startup, but I guess this is related just to error reporting, not media:
|
Things still fail with latest "drm-tip" (6.0-rc6) from today, and latest media-driver release:
EDIT: VAAPI init failure was due to kernel FW loading issue: https://gitlab.freedesktop.org/drm/intel/-/issues/6895 The error is now:
|
Any update on this? I've got a DG1 80EU and it fails decoding any video with VAAPI/QSV through ffmpeg cli. But everything works just fine on Windows. |
Things still fail with latest "drm-tip" (6.0-rc7) from yesterday, with a matching FW (GuC: 70.5.1, HuC: 7.9.3), and latest media stack releases:
Output from OneVPL tool:
Output from FFmpeg / VA-API:
OneVPL does not show anything in
E.g. running simple OpenCL program with latest public compute stack releases does not show any problems. |
could you help to have a try with #1500 |
Sure, but I'd like to see it first pass at least one of the CI tests... Currently they all fail for it? |
Tried it anyway. "Disable object capture for recoverable context" commit did not help, things fail like before. |
Coming here after my issue report on onevpl-intel-gpu: latest media-driver is completely unusable for me (I'm only interested in encoding). AVC and HEVC encoding both fail when using sample_encode program. Up till commit 60001c6 (bisected) HEVC encoding works. |
I tried DG1 on Windows 10 host last year, where it worked fine. |
@eero-t are you using the Mesa from https://github.com/intel-gpu/Mesa/tags or it's upstream Mesa? |
@Sherry-Lin I was using upstream Mesa i.e. version that distros will (eventually) include. |
Are you using Ubuntu 22.04? 2.1.2. Client Intel package repository configuration wget -qO - https://repositories.intel.com/graphics/intel-graphics.key | 2.1.3. Install Compute, Media, and Display runtimes Note: Intel’s version of Mesa includes support for the out-of-tree driver. Standard Mesa can be used for the easiest case where the default Ubuntu 23.04 driver is installed. sudo apt-get install -y |
@Jianshui Any ETA about fixing the DG1 media-driver support in non-OOT/upstream kernel? The OOT kernel maintainer even says they do not support DG1. It is difficult to understand why DG1 is still not fully supported in upstream as a product of the same period as TGL. |
sorry, it's out of my scope. I hope the OOT kernel can fulfill your requirements. |
Really hope you guys can fix the DG1 media-driver support in non-OOT/upstream kernel. It is difficult to build and install the OOT kernel in Niche Linux distributions like Unraid OS. Thus introducing DG1 support in upstream kernel realy means a lot. |
Media driver still needs PROD kernel to work on DG1, and loses almost all features in upstream kernel (decode, encode, vp all not work). |
Unfortunately the maintainer of PROD kernel says DG1 is not supported by them.
Also, No one seems to want to submit fixes for DG1 media feats from PROD i915 to upstream i915. At this point, DG1 has no where to go. |
Yes, that's why PROD i915 is the suggested kernel version for DG1 from KMD team. I'm going to close this issue. Please feel free to re-open if any concerns. |
So Intel is just giving up on driver support for the DG1 GPU? |
@Sherry-Lin Didn't you see that the DG1 is NOT being supported and tested with i915 backported PROD/OOT driver? The intel devs on the kernel side recommend users to use upstream kernel. |
Judging from your comments. Intel does not plan to provide any support for DG1 in either PROD/OOT or upstream kernel. |
FYI, just had a try today with latest drm-tip and media driver that everything seems work fine by applying this draft fix in media driver #1787. Upstream kernel doesn't support gem buffer object capture, just disable it in media driver. This is drm-tip version I tried: So, please feel free to give a try to this change. Note: ENABLE_PRODUCTION_KMD=OFF is required (default off) |
Hi @Jexu, thanks you very much for the patch. With decode: vpp: copy:
encode: kernel:
dmesg i915:
lspci:
Please let me know if you have more patches that can fix the above issues on DG1. |
About vp9 and hevc decoding hang, I had a comparasion of the batch buffer dump from PROD and upstrean kernel, unfortunately didn't find any difference. Luckly, draft pr #1787 could bring back most of media features. If you want, I could prepare a formal fix and get it merged into media driver. By the way, using env 'export INTEL_I915_CTX_CONTROL=1' to disable recoverable ctx could also solve it. |
More experiment: |
I didn't ever try it on DG1, but Xe kmd commits experimental support for DG1 and everything about xe kmd support is already in latest media driver. |
@Jexu With xe kmd mainlined, I revisited DG1 with it (linux 6.9.5 + xe.force_probe=4905). Now media works fine on DG1, including end-to-end transcoding. No more segfaults or insanely slow on certain codecs. I'd call it very usable, though it still needs time to prove its stability. |
I tried replacing the Unraid kernel with Linux 6.9.5 and added xe.force_probe=4908 to the startup script. It can only drive the DG1 but cannot decode properly. Are there any additional steps required? |
Possibilities:
|
Which component impacted?
Decode
Is it regression? Good in old configuration?
No response
What happened?
Use-cases
ffmpeg -hwaccel vaapi -hwaccel_output_format vaapi -i GTAV_1920x1080_60_yuv420p.h265 -c:v h264_vaapi -f null -
sample_multi_transcode -i::h265 GTAV_1920x1080_60_yuv420p.h265 -o::h264 /dev/null
Expected outcome
Both of above do transcoding at hundreds of FPS, like is the case with TGL iGPU, with exactly the same setup. Or if I change the input to H.264 one.
Actual outcome
I do not know whether this is a regression. There have been too many issues to say for sure whether it's ever worked on 0x4905 device.
What's the usage scenario when you are seeing the problem?
Transcode for media delivery
What impacted?
No response
Debug Information
Setup
VA-info
Notes
There are no GPU hangs. Kernel driver output / settings:
Do you want to contribute a patch to fix the issue?
No.
The text was updated successfully, but these errors were encountered: