added implementation proposal Intel Gaudi accelerator metrics integration #156

redhatHameed · 2025-11-05T17:38:23Z

Changes

This PR adds multi-vendor GPU/accelerator support to the observability stack, enabling monitoring of both NVIDIA GPUs (via DCGM) and Intel Gaudi accelerators (via Habana Labs exporter). The changes allow the system to automatically detect and use vendor-specific metrics while maintaining backward compatibility.

Key changes:

Extended GPU metric categorization to support Intel Gaudi metrics alongside NVIDIA DCGM metrics
Added automatic vendor detection and fallback queries for cross-vendor compatibility
Introduced comprehensive Intel Gaudi metric discovery function
Updated documentation with multi-vendor architecture and metric mapping

Checklist

Verify on the cluster
Update tests if applicable and run pytest
Add screenshots (if applicable)
Update readme (if applicable)

Copilot

Pull Request Overview

This PR adds multi-vendor GPU/accelerator support to the observability stack, enabling monitoring of both NVIDIA GPUs (via DCGM) and Intel Gaudi accelerators (via Habana Labs exporter). The changes allow the system to automatically detect and use vendor-specific metrics while maintaining backward compatibility.

Key changes:

Extended GPU metric categorization to support Intel Gaudi metrics alongside NVIDIA DCGM metrics
Added automatic vendor detection and fallback queries for cross-vendor compatibility
Introduced comprehensive Intel Gaudi metric discovery function
Updated documentation with multi-vendor architecture and metric mapping

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
src/core/promql_service.py	Extended GPU metric categorization to recognize and process both NVIDIA DCGM and Intel Gaudi metrics
src/core/metrics.py	Added Intel Gaudi metric discovery, updated GPU info detection for multi-vendor support, and modified vLLM/OpenShift metrics to use vendor-agnostic queries
docs/OBSERVABILITY_OVERVIEW.md	Added comprehensive multi-vendor GPU support documentation including architecture, vendor detection, and query examples
docs/INTEL_GAUDI_METRICS.md	New documentation detailing Intel Gaudi metrics, mapping to NVIDIA equivalents, and integration details

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/core/metrics.py

src/core/promql_service.py

src/core/metrics.py

src/core/promql_service.py

src/core/metrics.py

Signed-off-by: Abdul Hameed <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T18:30:09Z

src/core/metrics.py

+            "GPU Utilization (%)": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(habanalabs_utilization)",
+            "GPU Memory Usage (GB)": "avg(DCGM_FI_DEV_FB_USED) / (1024*1024*1024) or avg(habanalabs_memory_used_bytes) / (1024*1024*1024)",
+            "GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)",
+            "GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP) or avg(habanalabs_temperature_threshold_memory)",


The metric habanalabs_temperature_threshold_memory is a temperature threshold, not an actual temperature reading. This is being used as a fallback for DCGM_FI_DEV_MEMORY_TEMP which measures actual memory temperature. This creates a semantic mismatch where a threshold value would be displayed as if it were a current temperature measurement. Consider either finding the actual memory temperature metric for Intel Gaudi or removing this fallback.

Suggested change

"GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP) or avg(habanalabs_temperature_threshold_memory)",

"GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP)",

Copilot · 2025-11-10T18:30:09Z

src/core/promql_service.py

+        "habanalabs_energy": {
+            "type": "counter",
+            "description": "Intel Gaudi energy consumption",
+            "aggregation": "avg",


Counter metrics like habanalabs_energy, habanalabs_pcie_rx, and habanalabs_pcie_tx should use rate() or increase() functions in queries, not avg(). Using avg() on a counter will give you the average of cumulative values, which doesn't provide meaningful insights. The aggregation should be "rate" or "increase" to properly handle counter semantics.

Suggested change

"aggregation": "avg",

"aggregation": "rate",

Copilot · 2025-11-10T18:30:10Z

src/core/metrics.py

+                metric_name = metric_expr.split()[0] if " " not in metric_expr and "/" not in metric_expr else None
+                if metric_name and metric_name in all_metrics:
+                    metric_mapping[friendly_name] = f"avg({metric_name})"
+


The logic for parsing metric_expr is fragile and may fail for expressions with multiple operators. For example, habanalabs_temperature_threshold_memory doesn't contain / or multiple spaces, so metric_expr.split()[0] would just return the full metric name. However, the condition " " not in metric_expr and "/" not in metric_expr would be True, causing the split to happen unnecessarily. Consider simplifying this logic to check if the expression contains operators like / to determine if it's an expression or a simple metric name.

Suggested change

metric_name = metric_expr.split()[0] if " " not in metric_expr and "/" not in metric_expr else None

if metric_name and metric_name in all_metrics:

metric_mapping[friendly_name] = f"avg({metric_name})"

if re.match(r"^[a-zA-Z_:][a-zA-Z0-9_:]*$", metric_expr) and metric_expr in all_metrics:

metric_mapping[friendly_name] = f"avg({metric_expr})"

Copilot · 2025-11-10T18:30:10Z

docs/INTEL_GAUDI_METRICS.md

+| **Temperature** | `habanalabs_temperature_onchip` | `DCGM_FI_DEV_GPU_TEMP` | Celsius | On-chip/GPU temperature |
+| | `habanalabs_temperature_onboard` | N/A | Celsius | Board temperature |
+| | `habanalabs_temperature_threshold_gpu` | N/A | Celsius | GPU temperature threshold |
+| | `habanalabs_temperature_threshold_memory` | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature threshold |


The Intel Gaudi metric habanalabs_temperature_threshold_memory represents a temperature threshold, not the actual memory temperature. Using this in a comparison with DCGM_FI_DEV_MEMORY_TEMP (which is actual temperature) creates a semantic mismatch. Consider either documenting this difference clearly or using a different metric pairing that represents the same type of measurement.

Suggested change

| | `habanalabs_temperature_threshold_memory` | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature threshold |

| | `habanalabs_temperature_threshold_memory` | N/A | Celsius | Memory temperature threshold (threshold value, not real-time) |

| | N/A | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature (real-time, no Gaudi equivalent) |

Copilot · 2025-11-10T18:30:10Z

src/core/metrics.py

+            "GPU Power Usage (Watts)": "avg(DCGM_FI_DEV_POWER_USAGE) or avg(habanalabs_power_mW) / 1000",
+            "GPU Utilization (%)": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(habanalabs_utilization)",
+            "GPU Memory Usage (GB)": "avg(DCGM_FI_DEV_FB_USED) / (1024*1024*1024) or avg(habanalabs_memory_used_bytes) / (1024*1024*1024)",
+            "GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)",


Using avg() on the counter metric habanalabs_energy is incorrect. Counter metrics should be queried with rate() or increase() to get meaningful rate-of-change values. Using avg() on a monotonically increasing counter will just give you the average of cumulative values. Consider using rate(habanalabs_energy[5m]) or documenting that this represents total accumulated energy rather than consumption rate.

Suggested change

"GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)",

"GPU Energy Consumption (Joules)": "avg(rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m])) or avg(rate(habanalabs_energy[5m]))",

Copilot · 2025-11-10T18:30:11Z

src/core/metrics.py

-            info["models"] = ["GPU"]
+        nvidia_count = len(result)
+        if nvidia_count > 0:
+            temps = [float(series.get("value", [None, None])[1]) for series in result if series.get("value")]


The list comprehension can raise ValueError or TypeError if the value at index 1 cannot be converted to float (e.g., if it's None, a string, or missing). Consider adding error handling or a validation check: temps = [float(v[1]) for series in result if (v := series.get("value")) and len(v) > 1 and v[1] is not None] or wrapping in a try-except within the comprehension.

Copilot · 2025-11-10T18:30:11Z

src/core/metrics.py

+        result = resp.json().get("data", {}).get("result", [])
+        intel_count = len(result)
+        if intel_count > 0:
+            temps = [float(series.get("value", [None, None])[1]) for series in result if series.get("value")]


The list comprehension can raise ValueError or TypeError if the value at index 1 cannot be converted to float (e.g., if it's None, a string, or missing). Consider adding error handling or a validation check: temps = [float(v[1]) for series in result if (v := series.get("value")) and len(v) > 1 and v[1] is not None] or wrapping in a try-except within the comprehension.

Copilot · 2025-11-10T18:30:11Z

src/core/metrics.py

+                gaudi_mapping["GPU Energy Consumption (Joules)"] = f"avg({metric})"
+
+            # PCIe metrics
+            elif metric == "habanalabs_pcie_rx":
+                gaudi_mapping["PCIe RX Traffic (bytes)"] = f"avg({metric})"
+            elif metric == "habanalabs_pcie_tx":
+                gaudi_mapping["PCIe TX Traffic (bytes)"] = f"avg({metric})"


Using avg() on counter metrics habanalabs_energy, habanalabs_pcie_rx, and habanalabs_pcie_tx is incorrect. Counter metrics represent monotonically increasing values and should be queried with rate() or increase() to get meaningful rate-of-change information. For example, use rate(habanalabs_energy[5m]) for energy consumption rate, and rate(habanalabs_pcie_rx[5m]) for PCIe receive rate.

Suggested change

gaudi_mapping["GPU Energy Consumption (Joules)"] = f"avg({metric})"

# PCIe metrics

elif metric == "habanalabs_pcie_rx":

gaudi_mapping["PCIe RX Traffic (bytes)"] = f"avg({metric})"

elif metric == "habanalabs_pcie_tx":

gaudi_mapping["PCIe TX Traffic (bytes)"] = f"avg({metric})"

gaudi_mapping["GPU Energy Consumption (Joules)"] = f"rate({metric}[5m])"

# PCIe metrics

elif metric == "habanalabs_pcie_rx":

gaudi_mapping["PCIe RX Traffic (bytes)"] = f"rate({metric}[5m])"

elif metric == "habanalabs_pcie_tx":

gaudi_mapping["PCIe TX Traffic (bytes)"] = f"rate({metric}[5m])"

Copilot · 2025-11-10T18:30:11Z

src/core/metrics.py

+        # If no NVIDIA metrics, try Intel Gaudi metrics
+        if not metric_mapping:


The fallback logic for Intel Gaudi metrics only activates if not metric_mapping, which means it will skip Intel Gaudi metrics entirely if any NVIDIA metrics were found. This could be problematic if a cluster has both metric types available but some NVIDIA metrics are present while corresponding Intel Gaudi metrics should be used for specific devices. Consider checking for specific metric presence rather than assuming all-or-nothing based on whether any NVIDIA metrics exist.

Copilot · 2025-11-10T18:30:12Z

src/core/metrics.py

+            "GPU Power Usage (Watts)": "habanalabs_power_mW / 1000",
+            "GPU Memory Usage (GB)": "habanalabs_memory_used_bytes / (1024*1024*1024)",
+            "GPU Energy Consumption (Joules)": "habanalabs_energy",
+            "GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory",


The metric habanalabs_temperature_threshold_memory is a temperature threshold, not an actual temperature reading. This is being used as a fallback for DCGM_FI_DEV_MEMORY_TEMP which measures actual memory temperature. This creates a semantic mismatch where a threshold value would be displayed as if it were a current temperature measurement. Consider either finding the actual memory temperature metric for Intel Gaudi or removing this fallback.

Suggested change

"GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory",

# "GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory", # Removed: threshold, not actual temp

Copilot AI review requested due to automatic review settings November 5, 2025 17:38

Copilot AI reviewed Nov 5, 2025

View reviewed changes

src/core/metrics.py Outdated Show resolved Hide resolved

src/core/promql_service.py Outdated Show resolved Hide resolved

src/core/metrics.py Show resolved Hide resolved

src/core/promql_service.py Show resolved Hide resolved

src/core/metrics.py Show resolved Hide resolved

redhatHameed requested a review from tsisodia10 November 5, 2025 17:48

redhatHameed force-pushed the intel-metrics branch 2 times, most recently from 1ddc562 to 0190157 Compare November 5, 2025 18:09

redhatHameed changed the title ~~added document proposal Intel Gaudi accelerator metrics integration~~ added implementation proposal Intel Gaudi accelerator metrics integration Nov 6, 2025

added document proposal Intel Gaudi accelerator metrics integration

28595cf

Signed-off-by: Abdul Hameed <[email protected]>

Copilot AI review requested due to automatic review settings November 10, 2025 18:23

redhatHameed force-pushed the intel-metrics branch from 0190157 to 28595cf Compare November 10, 2025 18:23

Copilot AI reviewed Nov 10, 2025

View reviewed changes

	"GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP) or avg(habanalabs_temperature_threshold_memory)",
	"GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP)",

	\| \| `habanalabs_temperature_threshold_memory` \| `DCGM_FI_DEV_MEMORY_TEMP` \| Celsius \| Memory temperature threshold \|
	\| \| `habanalabs_temperature_threshold_memory` \| N/A \| Celsius \| Memory temperature threshold (threshold value, not real-time) \|
	\| \| N/A \| `DCGM_FI_DEV_MEMORY_TEMP` \| Celsius \| Memory temperature (real-time, no Gaudi equivalent) \|

	"GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)",
	"GPU Energy Consumption (Joules)": "avg(rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m])) or avg(rate(habanalabs_energy[5m]))",

		# If no NVIDIA metrics, try Intel Gaudi metrics
		if not metric_mapping:

	"GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory",
	# "GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory", # Removed: threshold, not actual temp

added implementation proposal Intel Gaudi accelerator metrics integration #156

Are you sure you want to change the base?

added implementation proposal Intel Gaudi accelerator metrics integration #156

Uh oh!

Conversation

redhatHameed commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

redhatHameed commented Nov 5, 2025 •

edited

Loading