Skip to content

Conversation

@redhatHameed
Copy link
Collaborator

@redhatHameed redhatHameed commented Nov 5, 2025

Changes

This PR adds multi-vendor GPU/accelerator support to the observability stack, enabling monitoring of both NVIDIA GPUs (via DCGM) and Intel Gaudi accelerators (via Habana Labs exporter). The changes allow the system to automatically detect and use vendor-specific metrics while maintaining backward compatibility.

Key changes:

Extended GPU metric categorization to support Intel Gaudi metrics alongside NVIDIA DCGM metrics
Added automatic vendor detection and fallback queries for cross-vendor compatibility
Introduced comprehensive Intel Gaudi metric discovery function
Updated documentation with multi-vendor architecture and metric mapping

Checklist

  • Verify on the cluster
  • Update tests if applicable and run pytest
  • Add screenshots (if applicable)
  • Update readme (if applicable)

Copilot AI review requested due to automatic review settings November 5, 2025 17:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds multi-vendor GPU/accelerator support to the observability stack, enabling monitoring of both NVIDIA GPUs (via DCGM) and Intel Gaudi accelerators (via Habana Labs exporter). The changes allow the system to automatically detect and use vendor-specific metrics while maintaining backward compatibility.

Key changes:

  • Extended GPU metric categorization to support Intel Gaudi metrics alongside NVIDIA DCGM metrics
  • Added automatic vendor detection and fallback queries for cross-vendor compatibility
  • Introduced comprehensive Intel Gaudi metric discovery function
  • Updated documentation with multi-vendor architecture and metric mapping

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/core/promql_service.py Extended GPU metric categorization to recognize and process both NVIDIA DCGM and Intel Gaudi metrics
src/core/metrics.py Added Intel Gaudi metric discovery, updated GPU info detection for multi-vendor support, and modified vLLM/OpenShift metrics to use vendor-agnostic queries
docs/OBSERVABILITY_OVERVIEW.md Added comprehensive multi-vendor GPU support documentation including architecture, vendor detection, and query examples
docs/INTEL_GAUDI_METRICS.md New documentation detailing Intel Gaudi metrics, mapping to NVIDIA equivalents, and integration details

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@redhatHameed redhatHameed force-pushed the intel-metrics branch 2 times, most recently from 1ddc562 to 0190157 Compare November 5, 2025 18:09
@redhatHameed redhatHameed changed the title added document proposal Intel Gaudi accelerator metrics integration added implementation proposal Intel Gaudi accelerator metrics integration Nov 6, 2025
Copilot AI review requested due to automatic review settings November 10, 2025 18:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"GPU Utilization (%)": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(habanalabs_utilization)",
"GPU Memory Usage (GB)": "avg(DCGM_FI_DEV_FB_USED) / (1024*1024*1024) or avg(habanalabs_memory_used_bytes) / (1024*1024*1024)",
"GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)",
"GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP) or avg(habanalabs_temperature_threshold_memory)",
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric habanalabs_temperature_threshold_memory is a temperature threshold, not an actual temperature reading. This is being used as a fallback for DCGM_FI_DEV_MEMORY_TEMP which measures actual memory temperature. This creates a semantic mismatch where a threshold value would be displayed as if it were a current temperature measurement. Consider either finding the actual memory temperature metric for Intel Gaudi or removing this fallback.

Suggested change
"GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP) or avg(habanalabs_temperature_threshold_memory)",
"GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP)",

Copilot uses AI. Check for mistakes.
"habanalabs_energy": {
"type": "counter",
"description": "Intel Gaudi energy consumption",
"aggregation": "avg",
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counter metrics like habanalabs_energy, habanalabs_pcie_rx, and habanalabs_pcie_tx should use rate() or increase() functions in queries, not avg(). Using avg() on a counter will give you the average of cumulative values, which doesn't provide meaningful insights. The aggregation should be "rate" or "increase" to properly handle counter semantics.

Suggested change
"aggregation": "avg",
"aggregation": "rate",

Copilot uses AI. Check for mistakes.
Comment on lines +517 to +520
metric_name = metric_expr.split()[0] if " " not in metric_expr and "/" not in metric_expr else None
if metric_name and metric_name in all_metrics:
metric_mapping[friendly_name] = f"avg({metric_name})"

Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for parsing metric_expr is fragile and may fail for expressions with multiple operators. For example, habanalabs_temperature_threshold_memory doesn't contain / or multiple spaces, so metric_expr.split()[0] would just return the full metric name. However, the condition " " not in metric_expr and "/" not in metric_expr would be True, causing the split to happen unnecessarily. Consider simplifying this logic to check if the expression contains operators like / to determine if it's an expression or a simple metric name.

Suggested change
metric_name = metric_expr.split()[0] if " " not in metric_expr and "/" not in metric_expr else None
if metric_name and metric_name in all_metrics:
metric_mapping[friendly_name] = f"avg({metric_name})"
if re.match(r"^[a-zA-Z_:][a-zA-Z0-9_:]*$", metric_expr) and metric_expr in all_metrics:
metric_mapping[friendly_name] = f"avg({metric_expr})"

Copilot uses AI. Check for mistakes.
| **Temperature** | `habanalabs_temperature_onchip` | `DCGM_FI_DEV_GPU_TEMP` | Celsius | On-chip/GPU temperature |
| | `habanalabs_temperature_onboard` | N/A | Celsius | Board temperature |
| | `habanalabs_temperature_threshold_gpu` | N/A | Celsius | GPU temperature threshold |
| | `habanalabs_temperature_threshold_memory` | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature threshold |
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Intel Gaudi metric habanalabs_temperature_threshold_memory represents a temperature threshold, not the actual memory temperature. Using this in a comparison with DCGM_FI_DEV_MEMORY_TEMP (which is actual temperature) creates a semantic mismatch. Consider either documenting this difference clearly or using a different metric pairing that represents the same type of measurement.

Suggested change
| | `habanalabs_temperature_threshold_memory` | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature threshold |
| | `habanalabs_temperature_threshold_memory` | N/A | Celsius | Memory temperature threshold (threshold value, not real-time) |
| | N/A | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature (real-time, no Gaudi equivalent) |

Copilot uses AI. Check for mistakes.
"GPU Power Usage (Watts)": "avg(DCGM_FI_DEV_POWER_USAGE) or avg(habanalabs_power_mW) / 1000",
"GPU Utilization (%)": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(habanalabs_utilization)",
"GPU Memory Usage (GB)": "avg(DCGM_FI_DEV_FB_USED) / (1024*1024*1024) or avg(habanalabs_memory_used_bytes) / (1024*1024*1024)",
"GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)",
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using avg() on the counter metric habanalabs_energy is incorrect. Counter metrics should be queried with rate() or increase() to get meaningful rate-of-change values. Using avg() on a monotonically increasing counter will just give you the average of cumulative values. Consider using rate(habanalabs_energy[5m]) or documenting that this represents total accumulated energy rather than consumption rate.

Suggested change
"GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)",
"GPU Energy Consumption (Joules)": "avg(rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m])) or avg(rate(habanalabs_energy[5m]))",

Copilot uses AI. Check for mistakes.
info["models"] = ["GPU"]
nvidia_count = len(result)
if nvidia_count > 0:
temps = [float(series.get("value", [None, None])[1]) for series in result if series.get("value")]
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list comprehension can raise ValueError or TypeError if the value at index 1 cannot be converted to float (e.g., if it's None, a string, or missing). Consider adding error handling or a validation check: temps = [float(v[1]) for series in result if (v := series.get("value")) and len(v) > 1 and v[1] is not None] or wrapping in a try-except within the comprehension.

Copilot uses AI. Check for mistakes.
result = resp.json().get("data", {}).get("result", [])
intel_count = len(result)
if intel_count > 0:
temps = [float(series.get("value", [None, None])[1]) for series in result if series.get("value")]
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list comprehension can raise ValueError or TypeError if the value at index 1 cannot be converted to float (e.g., if it's None, a string, or missing). Consider adding error handling or a validation check: temps = [float(v[1]) for series in result if (v := series.get("value")) and len(v) > 1 and v[1] is not None] or wrapping in a try-except within the comprehension.

Copilot uses AI. Check for mistakes.
Comment on lines +777 to +783
gaudi_mapping["GPU Energy Consumption (Joules)"] = f"avg({metric})"

# PCIe metrics
elif metric == "habanalabs_pcie_rx":
gaudi_mapping["PCIe RX Traffic (bytes)"] = f"avg({metric})"
elif metric == "habanalabs_pcie_tx":
gaudi_mapping["PCIe TX Traffic (bytes)"] = f"avg({metric})"
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using avg() on counter metrics habanalabs_energy, habanalabs_pcie_rx, and habanalabs_pcie_tx is incorrect. Counter metrics represent monotonically increasing values and should be queried with rate() or increase() to get meaningful rate-of-change information. For example, use rate(habanalabs_energy[5m]) for energy consumption rate, and rate(habanalabs_pcie_rx[5m]) for PCIe receive rate.

Suggested change
gaudi_mapping["GPU Energy Consumption (Joules)"] = f"avg({metric})"
# PCIe metrics
elif metric == "habanalabs_pcie_rx":
gaudi_mapping["PCIe RX Traffic (bytes)"] = f"avg({metric})"
elif metric == "habanalabs_pcie_tx":
gaudi_mapping["PCIe TX Traffic (bytes)"] = f"avg({metric})"
gaudi_mapping["GPU Energy Consumption (Joules)"] = f"rate({metric}[5m])"
# PCIe metrics
elif metric == "habanalabs_pcie_rx":
gaudi_mapping["PCIe RX Traffic (bytes)"] = f"rate({metric}[5m])"
elif metric == "habanalabs_pcie_tx":
gaudi_mapping["PCIe TX Traffic (bytes)"] = f"rate({metric}[5m])"

Copilot uses AI. Check for mistakes.
Comment on lines +503 to +504
# If no NVIDIA metrics, try Intel Gaudi metrics
if not metric_mapping:
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic for Intel Gaudi metrics only activates if not metric_mapping, which means it will skip Intel Gaudi metrics entirely if any NVIDIA metrics were found. This could be problematic if a cluster has both metric types available but some NVIDIA metrics are present while corresponding Intel Gaudi metrics should be used for specific devices. Consider checking for specific metric presence rather than assuming all-or-nothing based on whether any NVIDIA metrics exist.

Copilot uses AI. Check for mistakes.
"GPU Power Usage (Watts)": "habanalabs_power_mW / 1000",
"GPU Memory Usage (GB)": "habanalabs_memory_used_bytes / (1024*1024*1024)",
"GPU Energy Consumption (Joules)": "habanalabs_energy",
"GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory",
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric habanalabs_temperature_threshold_memory is a temperature threshold, not an actual temperature reading. This is being used as a fallback for DCGM_FI_DEV_MEMORY_TEMP which measures actual memory temperature. This creates a semantic mismatch where a threshold value would be displayed as if it were a current temperature measurement. Consider either finding the actual memory temperature metric for Intel Gaudi or removing this fallback.

Suggested change
"GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory",
# "GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory", # Removed: threshold, not actual temp

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant