Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions deploy/helm/rag/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ appVersion: "1.16.0"

dependencies:
- name: llm-service
version: 0.1.0
version: 0.5.4
repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
- name: llama-stack
version: 0.2.22
version: 0.6.0
repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
- name: pgvector
version: 0.1.0
version: 0.5.0
repository: https://rh-ai-quickstart.github.io/ai-architecture-charts
381 changes: 381 additions & 0 deletions docs/INTEL_GAUDI_METRICS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,381 @@
# Intel Gaudi Accelerator Metrics

## Overview

This document describes the Intel Gaudi accelerator metrics integration with the OpenShift AI Observability Summarizer. Intel Gaudi accelerators are monitored via the Habana Labs Prometheus metric exporter, which exposes metrics under the `habanalabs_` prefix.

**Metrics Implementation**: This observability stack implements **24 core Intel Gaudi metrics** that are essential for monitoring accelerator health, performance, and utilization. These metrics are automatically discovered, categorized, and integrated into dashboards alongside NVIDIA DCGM metrics for multi-vendor support.

The monitoring stack has been designed to support multi-vendor GPU/accelerator deployments, allowing different clusters to use either NVIDIA GPUs or Intel Gaudi accelerators (or potentially both, though different clusters typically use different accelerator types).

## Intel Gaudi Prometheus Exporter

### Deployment

The Intel Gaudi Prometheus exporter is deployed as a DaemonSet in Kubernetes/OpenShift clusters with Intel Gaudi accelerators. It exposes metrics on port **41611** using `hostNetwork: true`.

**Documentation**: [Intel Gaudi Prometheus Metric Exporter](https://docs.habana.ai/en/latest/Orchestration/Prometheus_Metric_Exporter.html)

### Enabling Monitoring for User-Defined Projects

In OpenShift Container Platform, Intel Gaudi metrics are collected through user workload monitoring. You must enable monitoring for user-defined projects to collect these metrics.

**Prerequisites:**
- You have access to the cluster as a user with the `cluster-admin` role
- Intel Gaudi Prometheus exporter is deployed in your namespace
- You have installed the OpenShift CLI (`oc`)

**Procedure:**

1. **Enable user workload monitoring** by editing the cluster monitoring ConfigMap:

```bash
oc -n openshift-monitoring edit configmap cluster-monitoring-config
```

2. **Add or update** the `enableUserWorkload` setting under `data/config.yaml`:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true
```
3. **Save the file** to apply the changes. User workload monitoring is enabled automatically.
4. **Verify** that the monitoring components are running:
```bash
oc -n openshift-user-workload-monitoring get pod
```

You should see pods like `prometheus-user-workload`, `prometheus-operator`, and `thanos-ruler-user-workload`.

> **Note**: The Habana AI metric exporter (`habana-ai-metric-exporter-ds`) is deployed as a DaemonSet in the `habana-ai-operator` namespace. It includes its own ServiceMonitor (`metric-exporter`) for metric collection and uses standard Kubernetes labels (`app.kubernetes.io/name=habana-ai`).
**Reference**: [OpenShift Container Platform - Enabling monitoring for user-defined projects](https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/monitoring/configuring-user-workload-monitoring#enabling-monitoring-for-user-defined-projects-uwm_preparing-to-configure-the-monitoring-stack-uwm)

### Accessing Metrics

Once user workload monitoring is enabled, metrics are automatically collected and can be queried through:

1. **OpenShift Console**: Navigate to **Observe > Metrics** and use PromQL queries
2. **Thanos Querier API**: Query programmatically using the Thanos endpoint
3. **Grafana**: If deployed, configure Thanos as a data source

### Querying Intel Gaudi Metrics

**Port-Forward Thanos Querier to access metrics in your browser:**

```bash
# Port-forward Thanos Querier to localhost
oc port-forward -n openshift-monitoring svc/thanos-querier 9090:9091
```

Then open your browser and navigate to: **http://localhost:9090**

**To explore Intel Gaudi metrics in the browser:**
1. In the query box, type: `habanalabs_` and press Tab for autocomplete
2. Select a metric like `habanalabs_temperature_onchip`
3. Click **Execute** to see the current values
4. Click **Graph** tab to see trends over time
5. Use filters like `habanalabs_temperature_onchip{namespace="your-namespace"}` to scope results

**Example queries to try:**
- `habanalabs_temperature_onchip` - Current GPU temperature
- `habanalabs_utilization` - GPU utilization percentage
- `habanalabs_memory_used_bytes / habanalabs_memory_total_bytes * 100` - Memory usage %
- `rate(habanalabs_energy[5m])` - Energy consumption rate

## Intel Gaudi Metrics Catalog

### Metrics Implemented in This Observability Stack

The following Intel Gaudi metrics are actively collected and used by this observability stack. These metrics are automatically discovered and integrated into dashboards and queries.

| Category | Intel Gaudi Metric | NVIDIA DCGM Equivalent | Unit | Used In |
|----------|-------------------|----------------------|------|---------|
| **Temperature** | `habanalabs_temperature_onchip` | `DCGM_FI_DEV_GPU_TEMP` | Celsius | Dashboards, Fleet Overview |
| | `habanalabs_temperature_onboard` | N/A | Celsius | Discovery Function |
| | `habanalabs_temperature_threshold_gpu` | N/A | Celsius | Discovery Function |
| | `habanalabs_temperature_threshold_memory` | `DCGM_FI_DEV_MEMORY_TEMP` | Celsius | Dashboards |
| **Power** | `habanalabs_power_mW` | `DCGM_FI_DEV_POWER_USAGE` | mW (÷1000 for Watts) | Dashboards, Fleet Overview |
| | `habanalabs_power_default_limit_mW` | N/A | mW | Discovery Function |
| **Utilization** | `habanalabs_utilization` | `DCGM_FI_DEV_GPU_UTIL` | Percent | Dashboards, vLLM Monitoring |
| **Memory** | `habanalabs_memory_used_bytes` | `DCGM_FI_DEV_FB_USED` | Bytes | Dashboards (converted to GB) |
| | `habanalabs_memory_total_bytes` | `DCGM_FI_DEV_FB_TOTAL` | Bytes | Memory % calculation |
| | `habanalabs_memory_free_bytes` | N/A | Bytes | Discovery Function |
| **Clock Speed** | `habanalabs_clock_soc_mhz` | `DCGM_FI_DEV_SM_CLOCK` | MHz | Discovery Function |
| | `habanalabs_clock_soc_max_mhz` | N/A | MHz | Discovery Function |
| **Energy** | `habanalabs_energy` | `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` | Joules | Dashboards |
| **PCIe** | `habanalabs_pcie_rx` | N/A | Bytes | Discovery Function |
| | `habanalabs_pcie_tx` | N/A | Bytes | Discovery Function |
| | `habanalabs_pcie_receive_throughput` | N/A | Throughput | Discovery Function |
| | `habanalabs_pcie_transmit_throughput` | N/A | Throughput | Discovery Function |
| | `habanalabs_pcie_replay_count` | N/A | Count | Discovery Function |
| | `habanalabs_pci_link_speed` | N/A | Speed | Discovery Function |
| | `habanalabs_pci_link_width` | N/A | Width | Discovery Function |
| **ECC/Health** | `habanalabs_ecc_feature_mode` | N/A | Status | Discovery Function |
| | `habanalabs_pending_rows_with_single_bit_ecc_errors` | N/A | Count | Discovery Function |
| | `habanalabs_pending_rows_with_double_bit_ecc_errors` | N/A | Count | Discovery Function |
| **Network** | `habanalabs_nic_port_status` | N/A | Status | Discovery Function |

**Total: 24 metrics actively used** by the observability stack.

### Additional Available Metrics

The Intel Gaudi Prometheus exporter exposes additional metrics that are not currently used by this observability stack but are available for custom queries and dashboards:

**Temperature Thresholds (Not Currently Used):**
- `habanalabs_temperature_threshold_shutdown` - Temperature at which device shuts down
- `habanalabs_temperature_threshold_slowdown` - Temperature at which device slows down

**Memory Health (Not Currently Used):**
- `habanalabs_pending_rows_state` - Number of memory rows in pending state

**Device Information (Not Currently Used):**
- `habanalabs_device_config` - Device configuration metadata

> **Note**: These metrics are available from the Intel Gaudi exporter but are not actively integrated into the default dashboards or metric discovery functions. You can query them manually if needed for specific use cases.
## Query Examples

### Basic Monitoring Queries

#### Average GPU Temperature Across All Gaudi Accelerators
```promql
avg(habanalabs_temperature_onchip)
```

#### GPU Utilization Per Device
```promql
habanalabs_utilization
```

#### Power Usage in Watts (Convert from mW)
```promql
avg(habanalabs_power_mW) / 1000
```

#### Memory Usage Percentage
```promql
(avg(habanalabs_memory_used_bytes) / avg(habanalabs_memory_total_bytes)) * 100
```

#### Memory Usage in GB
```promql
avg(habanalabs_memory_used_bytes) / (1024*1024*1024)
```

### Advanced Monitoring Queries

#### PCIe Throughput Rate
```promql
rate(habanalabs_pcie_rx[5m]) + rate(habanalabs_pcie_tx[5m])
```

#### Energy Consumption Rate
```promql
rate(habanalabs_energy[1h])
```

#### ECC Errors Total
```promql
sum(habanalabs_pending_rows_with_single_bit_ecc_errors) + sum(habanalabs_pending_rows_with_double_bit_ecc_errors)
```

#### High Temperature Alert Query
```promql
habanalabs_temperature_onchip > 80
```

#### High Power Usage Alert Query
```promql
habanalabs_power_mW > 400000 # 400W
```

### Multi-Vendor Queries

The monitoring stack supports vendor-agnostic queries that work with both NVIDIA and Intel Gaudi:

#### GPU Temperature (Any Vendor)
```promql
avg(DCGM_FI_DEV_GPU_TEMP) or avg(habanalabs_temperature_onchip)
```

#### GPU Utilization (Any Vendor)
```promql
avg(DCGM_FI_DEV_GPU_UTIL) or avg(habanalabs_utilization)
```

#### GPU Memory Usage in GB (Any Vendor)
```promql
avg(DCGM_FI_DEV_FB_USED) / (1024*1024*1024) or avg(habanalabs_memory_used_bytes) / (1024*1024*1024)
```

## Integration with Observability Stack

### Automatic Discovery

The observability stack automatically discovers available Intel Gaudi metrics through the `discover_intel_gaudi_metrics()` function in `src/core/metrics.py`. This function:

1. Queries Prometheus for all metrics with the `habanalabs_` prefix
2. Maps them to friendly names for dashboard display
3. Configures appropriate aggregations and units
4. Returns a dictionary of available metrics

### Dashboard Integration

Intel Gaudi metrics are integrated into existing dashboards:

- **Fleet Overview**: Shows GPU utilization and temperature for available accelerators
- **GPU & Accelerators**: Comprehensive accelerator monitoring section with vendor-agnostic queries
- **vLLM Monitoring**: Automatically uses Intel Gaudi metrics when NVIDIA metrics are unavailable

### Metric Categorization

Intel Gaudi metrics are categorized in `src/core/promql_service.py`:

- **Type**: `gpu`
- **Categories**: `["hardware_metric", "gpu_metric", "intel_gaudi_metric"]`
- **Vendor Detection**: Automatic based on `habanalabs_` prefix
- **Unit Conversion**: Power converted from mW to W for consistency

## Troubleshooting

### No Metrics Available

If Intel Gaudi metrics are not appearing:

1. **Verify user workload monitoring is enabled**:
```bash
oc -n openshift-user-workload-monitoring get pods
```
You should see `prometheus-user-workload` pods running.

2. **Check Intel Gaudi exporter deployment**:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the exporter get automatically deployed when the user workload monitoring is enabled? Does it need to be added to make install?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not required to be added to the makefile , this is perquisites configuration when Intel Gaudi Base Operator or related component installed. the steps added for troubleshooting purpose incase the application is not showing the metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

```bash
# The Habana AI metric exporter is deployed as a DaemonSet in the habana-ai-operator namespace
# Pod names follow the pattern: habana-ai-metric-exporter-ds-xxxxx
oc get pods -n habana-ai-operator -l app.kubernetes.io/name=habana-ai
```

3. **Verify ServiceMonitor exists**:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: is the service monitor automatically deployed in the cluster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this should be deployed already

oc get servicemonitor -n habana-ai-operator
NAME                             AGE
bmc-monitoring-service-monitor   28d
metric-exporter                  28d

I have updated the document accordingly as well

```bash
# Check if the metric-exporter ServiceMonitor is deployed
oc get servicemonitor -n habana-ai-operator
```
You should see `metric-exporter` in the list. This ServiceMonitor configures Prometheus to scrape Intel Gaudi metrics.

4. **Check if metrics are exposed**:
```bash
# Get the exporter pod name
POD=$(oc get pods -n habana-ai-operator -l app.kubernetes.io/name=habana-ai -o name | head -n 1)

# Port-forward to the exporter (uses port 41611)
oc port-forward -n habana-ai-operator $POD 41611:41611 &

# Query metrics directly
curl http://localhost:41611/metrics | grep habanalabs

# Stop port-forward
pkill -f "port-forward.*41611"
```

5. **Review exporter logs**:
```bash
oc logs -n habana-ai-operator -l app.kubernetes.io/name=habana-ai
```

### Metrics Not Discovered by Observability Stack

If metrics are available but not appearing in the observability stack:

1. **Check Thanos can query the metrics**:
```bash
# Port-forward Thanos Querier
oc port-forward -n openshift-monitoring svc/thanos-querier 9090:9091 &

# Query for Intel Gaudi metrics
curl -s "http://localhost:9090/api/v1/label/__name__/values" | grep habanalabs

# Stop port-forward
pkill -f "port-forward.*thanos-querier"
```

2. **Verify metrics are being scraped**:
- Open browser to Thanos UI: `oc port-forward -n openshift-monitoring svc/thanos-querier 9090:9091`
- Navigate to http://localhost:9090
- Go to **Status****Targets**
- Search for your namespace and verify the Intel Gaudi exporter target is UP

3. **Review MCP server logs**:
```bash
oc logs -n <your-namespace> deployment/mcp-server | grep -i "gaudi\|habanalabs"
```

### Metric Value Issues

If metrics show unexpected values:

- **Temperature**: Should be in Celsius (typical range: 30-90°C)
- **Power**: In milliwatts - divide by 1000 for Watts (typical: 200,000-500,000 mW = 200-500W)
- **Utilization**: Percentage 0-100
- **Memory**: In bytes - divide by (1024³) for GB

## Performance Considerations

### Scrape Interval

Recommended scrape interval: **15-30 seconds**

- Shorter intervals provide more granular data but increase load
- Longer intervals reduce overhead but may miss transient events

### Metric Retention

Consider data retention policies based on:
- **Short-term** (1-7 days): All metrics for detailed analysis
- **Medium-term** (7-30 days): Aggregated metrics at 1-5 minute resolution
- **Long-term** (30+ days): Hourly/daily aggregates for trend analysis

### Query Optimization

- Use `avg()` for fleet-wide metrics instead of individual device queries
- Apply time range limits to reduce query load
- Use recording rules for frequently accessed aggregations

## Future Vendor Support

This observability stack is designed to support multiple GPU and accelerator vendors. The current implementation supports:
- ✅ NVIDIA GPUs (DCGM exporter)
- ✅ Intel Gaudi (Habana Labs exporter)

**Future vendors** (e.g., AMD and other accelerator vendors) can be added following the same pattern:
1. Search for `TODO.*AMD` in the codebase to find all integration points
2. Follow the Intel Gaudi implementation pattern in `src/core/metrics.py` and `src/core/promql_service.py`
3. Add vendor-specific metrics following the same structure
4. Update vendor-agnostic queries in dashboards
5. Test with the new vendor's hardware

The code includes TODO comments marking exactly where new vendor support should be added, making future extensions straightforward.

## References

- [Intel Gaudi Prometheus Metric Exporter Documentation](https://docs.habana.ai/en/latest/Orchestration/Prometheus_Metric_Exporter.html)
- [Intel Gaudi Architecture Overview](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html)
- [Prometheus Query Language (PromQL)](https://prometheus.io/docs/prometheus/latest/querying/basics/)
- [OpenShift AI Observability Overview](./OBSERVABILITY_OVERVIEW.md)

## Support

For issues specific to:
- **Intel Gaudi exporter**: Refer to Intel Gaudi documentation
- **Observability stack integration**: Create an issue in this repository
- **Prometheus/Thanos**: Consult respective project documentation

Loading
Loading