Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zesDeviceProcessesGetState is returning 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE) #809

Open
jketreno opened this issue Feb 8, 2025 · 4 comments
Labels
bug in queue L0 Sysman Issue related to L0 Sysman

Comments

@jketreno
Copy link

jketreno commented Feb 8, 2025

I'm writing a small ze-top like utility to monitor the B580. It looks like zesDeviceProcessesGetState should be able to tell me the info for processes using the GPU. However, it always returns ZE_RESULT_ERROR_UNSUPPORTED_FEATURE. That error return code is documented for other APIs, but doesn't seem to be in the list of valid return codes for zesDeviceProcessesGetState

I have a valid device handle, which I'm using to call zesDeviceEnumEngineGroups to get usage info from the engines, and that's working well.

I've tried running as sudo in case there was a permissions issue, but that didn't help.

#define _MAX_PROCESSES 2048
processCount = _MAX_PROCESSES;
zes_process_state_t allProcesses[_MAX_PROCESS];
ret = zesDeviceProcessesGetState(hSysmanHandle, &processCount, allProcesses);
if (ret != ZE_RESULT_SUCCESS && ret != ZE_RESULT_ERROR_INVALID_SIZE) {
    fprintf(stderr, "Unable to get process information (ret count %u): %08X (%s)\n", processCount, ret, ze_error_to_str(ret));
}
...

The above outputs:

Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)

I've tried setting processCount to 0 to have it tell me how many process items to use, but that has the same error code returned.

I'm using libze-intel-gpu1 version 24.52.32224.5-124.10ppa2, and libze1 version 1.19.2.0-1076~24.10.

Thanks,
James

@JablonskiMateusz JablonskiMateusz added the L0 Sysman Issue related to L0 Sysman label Feb 10, 2025
@jketreno
Copy link
Author

Adding additional context; it looks like the device handle I was using was for the integrated Intel UHD 770:

Output while UHD 770 is running a workload, and I monitor the UHD 770:

Device 0: 868080A7-0400-0000-0002-000000000000
 BDF: 0000:0000:0002:0000
 PCI ID: 8086:A780
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) UHD Graphics 770
 Vendor Name: Intel(R) Corporation
 Driver Version: 7209A40C3CFCD5142354A9F
 Type: GPU
 Is integrated with host: Yes
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: No
Device 0: 7 engines found.
 Engine 0:
  Type: ZES_ENGINE_GROUP_RENDER_SINGLE
  Sub-device: No
 Engine 1:
  Type: ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE
  Sub-device: No
 Engine 2:
  Type: ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE
  Sub-device: No
 Engine 3:
  Type: ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE
  Sub-device: No
 Engine 4:
  Type: ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE
  Sub-device: No
 Engine 5:
  Type: ZES_ENGINE_GROUP_COPY_SINGLE
  Sub-device: No
 Engine 6:
  Type: ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE
  Sub-device: No
INFO: No temperature sensors to monitor.
Monitoring 7 engines.
ZES_ENGINE_GROUP_RENDER_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: N/A
ZES_ENGINE_GROUP_COPY_SINGLE: N/A
ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE: N/A
Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)
ZES_ENGINE_GROUP_RENDER_SINGLE: 98%
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_DECODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENCODE_SINGLE: 0%
ZES_ENGINE_GROUP_COPY_SINGLE: 0%
ZES_ENGINE_GROUP_MEDIA_ENHANCEMENT_SINGLE: 0%
Unable to get process information (ret count 2048): 78000003 (ZE_RESULT_ERROR_UNSUPPORTED_FEATURE)
...

I had mistakenly thought the B580 would have engine groups, so mistook the existence of engine groups meaning it was running on the B580. So while zesDeviceProcessesGetState is working correctly on the B580, it is failing on the UHD 770.

When I run the workload on the B580 and and monitor it, zesDeviceProcessesGetState is showing activity on engine type ZES_ENGINE_TYPE_FLAG_COMPUTE, zesDeviceEnumEngineGroups is not returning any engine groups for the B580. Is there another way to track compute utilization w/ the B580 or is there a kernel parameter required to turn that on in the Xe driver?

Output while running workload on B580 and monitor its usage:

Device 0: 86800BE2-0000-0000-0300-000000000000
 BDF: 0000:0003:0000:0000
 PCI ID: 8086:E20B
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) Graphics [0xe20b]
 Vendor Name: Intel(R) Corporation
 Driver Version: 977D4CB66F62C239FD56D33
 Type: GPU
 Is integrated with host: No
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: Yes
Device 0: 0 engines found.
INFO: No temperature sensors to monitor.
INFO: No engines to monitor.
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
       26537 python chat.py                 MEM: 5556486144           SHR: 0                    FLAGS: COMPUTE
...

An oddity is when running the workload on the integrated GPU (i915) the query to the B580 for process stats is showing the process that the i915 driver is using, but with no engine group flags:

Output while UHD 770 is running a workload, and I monitor the B580:

Device 0: 86800BE2-0000-0000-0300-000000000000
 BDF: 0000:0003:0000:0000
 PCI ID: 8086:E20B
 Subdevices: 0
 Serial Number: unknown
 Board Number: unknown
 Brand Name: unknown
 Model Name: Intel(R) Graphics [0xe20b]
 Vendor Name: Intel(R) Corporation
 Driver Version: 977D4CB66F62C239FD56D33
 Type: GPU
 Is integrated with host: No
 Is a sub-device: No
 Supports error correcting memory: No
 Supports on-demand pauge-faulting: Yes
Device 0: 0 engines found.
INFO: No temperature sensors to monitor.
INFO: No engines to monitor.
       23724 python chat.py                 MEM: 3420160              SHR: 0                    FLAGS:

@saik-intel
Copy link
Contributor

@jketreno we will look into internally and update you

@saik-intel
Copy link
Contributor

When I run the workload on the B580 and and monitor it, zesDeviceProcessesGetState is showing activity on engine type ZES_ENGINE_TYPE_FLAG_COMPUTE, zesDeviceEnumEngineGroups is not returning any engine groups for the B580. Is there another way to track compute utilization w/ the B580 or is there a kernel parameter required to turn that on in the Xe driver?

[Sai] XE driver upstream patch is in review and waiting for merge. once it is ready, it will merge and regarding other issue you raised for UHD770 , we able to see its working as per below log

root@DUT6051BMGSVC:/home/gta/level_zero/bin# export ZELLO_SYSMAN_USE_ZESINIT=1; export ZES_ENABLE_SYSMAN=1; export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/gta/level_zero/libs/:/home/gta/level_zero/latest_loa der/:/home/gta/level_zero/bin/;
root@DUT6051BMGSVC:/home/gta/level_zero/bin# ./zello_sysman -g
ZES_ENABLE_SYSMAN environment variable Set
Sysman Initialization done via zesInit ---- Global Operations tests ----
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = Intel(R) Corporation
properties.modelName = Intel(R) UHD Graphics 770
properties.vendorName = Intel(R) Corporation
properties.driverVersion = BABE9C47939376BE4C71D06
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 42880
properties.core.flags = 1
properties.core.coreClockRate = 1650
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 7
properties.core.numEUsPerSubslice = 16
properties.core.numSubslicesPerSlice = 2
properties.core.numSlices = 1
properties.core.timerResolution = 52
properties.core.timestampValidBits = 36
properties.core.kernelTimestampValidBits = 32
properties.core.uuid =
134 128 128 167 4 0 0 0 0 2 0 0 0 0 0 0
properties.core.name = Intel(R) UHD Graphics 770
reset status: 0
repair0 ---- Global Operations tests ----
properties.numSubdevices = 0
properties.serialNumber = unknown
properties.boardNumber = unknown
properties.brandName = Intel(R) Corporation
properties.modelName = Intel(R) Arc(TM) B580 Graphics
properties.vendorName = Intel(R) Corporation
properties.driverVersion = BABE9C47939376BE4C71D06
properties.core.type = 1
properties.core.vendorId = 32902
properties.core.deviceId = 57867
properties.core.flags = 8
properties.core.coreClockRate = 2850
properties.core.maxHardwareContexts = 65536
properties.core.maxCommandQueuePriority = 0
properties.core.numThreadsPerEU = 8
properties.core.numEUsPerSubslice = 8
properties.core.numSubslicesPerSlice = 4
properties.core.numSlices = 5
properties.core.timerResolution = 52
properties.core.timestampValidBits = 64
properties.core.kernelTimestampValidBits = 64
properties.core.uuid =
134 128 11 226 0 0 0 0 3 0 0 0 0 0 0 0
properties.core.name = Intel(R) Arc(TM) B580 Graphics
reset status: 0
repair0

@eero-t
Copy link

eero-t commented Feb 12, 2025

This looks like relevant kernel patch series, but it's for Xe KMD tree, not upstream: https://patchwork.freedesktop.org/series/144408/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug in queue L0 Sysman Issue related to L0 Sysman
Projects
None yet
Development

No branches or pull requests

4 participants