-
Notifications
You must be signed in to change notification settings - Fork 17
added implementation proposal Intel Gaudi accelerator metrics integration #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds multi-vendor GPU/accelerator support to the observability stack, enabling monitoring of both NVIDIA GPUs (via DCGM) and Intel Gaudi accelerators (via Habana Labs exporter). The changes allow the system to automatically detect and use vendor-specific metrics while maintaining backward compatibility.
Key changes:
- Extended GPU metric categorization to support Intel Gaudi metrics alongside NVIDIA DCGM metrics
- Added automatic vendor detection and fallback queries for cross-vendor compatibility
- Introduced comprehensive Intel Gaudi metric discovery function
- Updated documentation with multi-vendor architecture and metric mapping
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/core/promql_service.py | Extended GPU metric categorization to recognize and process both NVIDIA DCGM and Intel Gaudi metrics |
| src/core/metrics.py | Added Intel Gaudi metric discovery, updated GPU info detection for multi-vendor support, and modified vLLM/OpenShift metrics to use vendor-agnostic queries |
| docs/OBSERVABILITY_OVERVIEW.md | Added comprehensive multi-vendor GPU support documentation including architecture, vendor detection, and query examples |
| docs/INTEL_GAUDI_METRICS.md | New documentation detailing Intel Gaudi metrics, mapping to NVIDIA equivalents, and integration details |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1ddc562 to
0190157
Compare
Signed-off-by: Abdul Hameed <[email protected]>
0190157 to
28595cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "GPU Utilization (%)": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(habanalabs_utilization)", | ||
| "GPU Memory Usage (GB)": "avg(DCGM_FI_DEV_FB_USED) / (1024*1024*1024) or avg(habanalabs_memory_used_bytes) / (1024*1024*1024)", | ||
| "GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)", | ||
| "GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP) or avg(habanalabs_temperature_threshold_memory)", |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric habanalabs_temperature_threshold_memory is a temperature threshold, not an actual temperature reading. This is being used as a fallback for DCGM_FI_DEV_MEMORY_TEMP which measures actual memory temperature. This creates a semantic mismatch where a threshold value would be displayed as if it were a current temperature measurement. Consider either finding the actual memory temperature metric for Intel Gaudi or removing this fallback.
| "GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP) or avg(habanalabs_temperature_threshold_memory)", | |
| "GPU Memory Temperature (°C)": "avg(DCGM_FI_DEV_MEMORY_TEMP)", |
| "habanalabs_energy": { | ||
| "type": "counter", | ||
| "description": "Intel Gaudi energy consumption", | ||
| "aggregation": "avg", |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counter metrics like habanalabs_energy, habanalabs_pcie_rx, and habanalabs_pcie_tx should use rate() or increase() functions in queries, not avg(). Using avg() on a counter will give you the average of cumulative values, which doesn't provide meaningful insights. The aggregation should be "rate" or "increase" to properly handle counter semantics.
| "aggregation": "avg", | |
| "aggregation": "rate", |
| metric_name = metric_expr.split()[0] if " " not in metric_expr and "/" not in metric_expr else None | ||
| if metric_name and metric_name in all_metrics: | ||
| metric_mapping[friendly_name] = f"avg({metric_name})" | ||
|
|
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for parsing metric_expr is fragile and may fail for expressions with multiple operators. For example, habanalabs_temperature_threshold_memory doesn't contain / or multiple spaces, so metric_expr.split()[0] would just return the full metric name. However, the condition " " not in metric_expr and "/" not in metric_expr would be True, causing the split to happen unnecessarily. Consider simplifying this logic to check if the expression contains operators like / to determine if it's an expression or a simple metric name.
| metric_name = metric_expr.split()[0] if " " not in metric_expr and "/" not in metric_expr else None | |
| if metric_name and metric_name in all_metrics: | |
| metric_mapping[friendly_name] = f"avg({metric_name})" | |
| if re.match(r"^[a-zA-Z_:][a-zA-Z0-9_:]*$", metric_expr) and metric_expr in all_metrics: | |
| metric_mapping[friendly_name] = f"avg({metric_expr})" |
| | **Temperature** | `habanalabs_temperature_onchip` | `DCGM_FI_DEV_GPU_TEMP` | Celsius | On-chip/GPU temperature | | ||
| | | `habanalabs_temperature_onboard` | N/A | Celsius | Board temperature | | ||
| | | `habanalabs_temperature_threshold_gpu` | N/A | Celsius | GPU temperature threshold | | ||
| | | `habanalabs_temperature_threshold_memory` | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature threshold | |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Intel Gaudi metric habanalabs_temperature_threshold_memory represents a temperature threshold, not the actual memory temperature. Using this in a comparison with DCGM_FI_DEV_MEMORY_TEMP (which is actual temperature) creates a semantic mismatch. Consider either documenting this difference clearly or using a different metric pairing that represents the same type of measurement.
| | | `habanalabs_temperature_threshold_memory` | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature threshold | | |
| | | `habanalabs_temperature_threshold_memory` | N/A | Celsius | Memory temperature threshold (threshold value, not real-time) | | |
| | | N/A | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Memory temperature (real-time, no Gaudi equivalent) | |
| "GPU Power Usage (Watts)": "avg(DCGM_FI_DEV_POWER_USAGE) or avg(habanalabs_power_mW) / 1000", | ||
| "GPU Utilization (%)": "avg(DCGM_FI_DEV_GPU_UTIL) or avg(habanalabs_utilization)", | ||
| "GPU Memory Usage (GB)": "avg(DCGM_FI_DEV_FB_USED) / (1024*1024*1024) or avg(habanalabs_memory_used_bytes) / (1024*1024*1024)", | ||
| "GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)", |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using avg() on the counter metric habanalabs_energy is incorrect. Counter metrics should be queried with rate() or increase() to get meaningful rate-of-change values. Using avg() on a monotonically increasing counter will just give you the average of cumulative values. Consider using rate(habanalabs_energy[5m]) or documenting that this represents total accumulated energy rather than consumption rate.
| "GPU Energy Consumption (Joules)": "avg(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION) or avg(habanalabs_energy)", | |
| "GPU Energy Consumption (Joules)": "avg(rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m])) or avg(rate(habanalabs_energy[5m]))", |
| info["models"] = ["GPU"] | ||
| nvidia_count = len(result) | ||
| if nvidia_count > 0: | ||
| temps = [float(series.get("value", [None, None])[1]) for series in result if series.get("value")] |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list comprehension can raise ValueError or TypeError if the value at index 1 cannot be converted to float (e.g., if it's None, a string, or missing). Consider adding error handling or a validation check: temps = [float(v[1]) for series in result if (v := series.get("value")) and len(v) > 1 and v[1] is not None] or wrapping in a try-except within the comprehension.
| result = resp.json().get("data", {}).get("result", []) | ||
| intel_count = len(result) | ||
| if intel_count > 0: | ||
| temps = [float(series.get("value", [None, None])[1]) for series in result if series.get("value")] |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list comprehension can raise ValueError or TypeError if the value at index 1 cannot be converted to float (e.g., if it's None, a string, or missing). Consider adding error handling or a validation check: temps = [float(v[1]) for series in result if (v := series.get("value")) and len(v) > 1 and v[1] is not None] or wrapping in a try-except within the comprehension.
| gaudi_mapping["GPU Energy Consumption (Joules)"] = f"avg({metric})" | ||
|
|
||
| # PCIe metrics | ||
| elif metric == "habanalabs_pcie_rx": | ||
| gaudi_mapping["PCIe RX Traffic (bytes)"] = f"avg({metric})" | ||
| elif metric == "habanalabs_pcie_tx": | ||
| gaudi_mapping["PCIe TX Traffic (bytes)"] = f"avg({metric})" |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using avg() on counter metrics habanalabs_energy, habanalabs_pcie_rx, and habanalabs_pcie_tx is incorrect. Counter metrics represent monotonically increasing values and should be queried with rate() or increase() to get meaningful rate-of-change information. For example, use rate(habanalabs_energy[5m]) for energy consumption rate, and rate(habanalabs_pcie_rx[5m]) for PCIe receive rate.
| gaudi_mapping["GPU Energy Consumption (Joules)"] = f"avg({metric})" | |
| # PCIe metrics | |
| elif metric == "habanalabs_pcie_rx": | |
| gaudi_mapping["PCIe RX Traffic (bytes)"] = f"avg({metric})" | |
| elif metric == "habanalabs_pcie_tx": | |
| gaudi_mapping["PCIe TX Traffic (bytes)"] = f"avg({metric})" | |
| gaudi_mapping["GPU Energy Consumption (Joules)"] = f"rate({metric}[5m])" | |
| # PCIe metrics | |
| elif metric == "habanalabs_pcie_rx": | |
| gaudi_mapping["PCIe RX Traffic (bytes)"] = f"rate({metric}[5m])" | |
| elif metric == "habanalabs_pcie_tx": | |
| gaudi_mapping["PCIe TX Traffic (bytes)"] = f"rate({metric}[5m])" |
| # If no NVIDIA metrics, try Intel Gaudi metrics | ||
| if not metric_mapping: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback logic for Intel Gaudi metrics only activates if not metric_mapping, which means it will skip Intel Gaudi metrics entirely if any NVIDIA metrics were found. This could be problematic if a cluster has both metric types available but some NVIDIA metrics are present while corresponding Intel Gaudi metrics should be used for specific devices. Consider checking for specific metric presence rather than assuming all-or-nothing based on whether any NVIDIA metrics exist.
| "GPU Power Usage (Watts)": "habanalabs_power_mW / 1000", | ||
| "GPU Memory Usage (GB)": "habanalabs_memory_used_bytes / (1024*1024*1024)", | ||
| "GPU Energy Consumption (Joules)": "habanalabs_energy", | ||
| "GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory", |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric habanalabs_temperature_threshold_memory is a temperature threshold, not an actual temperature reading. This is being used as a fallback for DCGM_FI_DEV_MEMORY_TEMP which measures actual memory temperature. This creates a semantic mismatch where a threshold value would be displayed as if it were a current temperature measurement. Consider either finding the actual memory temperature metric for Intel Gaudi or removing this fallback.
| "GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory", | |
| # "GPU Memory Temperature (°C)": "habanalabs_temperature_threshold_memory", # Removed: threshold, not actual temp |
Changes
This PR adds multi-vendor GPU/accelerator support to the observability stack, enabling monitoring of both NVIDIA GPUs (via DCGM) and Intel Gaudi accelerators (via Habana Labs exporter). The changes allow the system to automatically detect and use vendor-specific metrics while maintaining backward compatibility.
Key changes:
Extended GPU metric categorization to support Intel Gaudi metrics alongside NVIDIA DCGM metrics
Added automatic vendor detection and fallback queries for cross-vendor compatibility
Introduced comprehensive Intel Gaudi metric discovery function
Updated documentation with multi-vendor architecture and metric mapping
Checklist
pytest