Skip to content

Commit 54e41ac

Browse files
committed
Feature: Added logging, lokistack for observability
* Added lokistack and minio files * Updated deployment instructions * Updated Makefile, scripts, and docs * Remove email, modified log query access, added uiplugin, update docs * Changed LokiStack namespace to openshift-logging for Korrel8 integration * Fixed installation errors for loki/logging in Makefile for seamless deployment Signed-off-by: Manna Kong <[email protected]>
1 parent 3d25565 commit 54e41ac

File tree

18 files changed

+2293
-84
lines changed

18 files changed

+2293
-84
lines changed

Makefile

Lines changed: 327 additions & 26 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 50 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,14 @@ Key capabilities include automated performance analysis, predictive cost optimiz
1919
Perfect for AI operations teams, platform engineers, and business stakeholders who need to understand and optimize their AI infrastructure without becoming metrics experts.
2020

2121
### **Main Features**
22-
- **Chat with Prometheus/Alertmanager/Tempo** - Ask questions about metrics, alerts, and traces in natural language
22+
23+
- **Chat with Prometheus/Alertmanager/Tempo/Loki** - Ask questions about metrics, alerts, traces, and logs in natural language
2324
- **AI-Powered Insights** - Natural language queries with intelligent responses
24-
- **GPU & Model Monitoring** - Real-time vLLM and DCGM metrics tracking
25+
- **GPU & Model Monitoring** - Real-time vLLM and DCGM metrics tracking
2526
- **Multi-Dashboard Interface** - vLLM, OpenShift, and Chat interfaces
2627
- **Report Generation** - Export analysis in HTML, PDF, or Markdown
2728
- **Smart Alerting** - AI-powered Slack notifications and custom thresholds
28-
- **Distributed Tracing** - Complete observability stack with OpenTelemetry
29+
- **Complete Observability Stack** - Distributed tracing (Tempo), log aggregation (Loki), and metrics (Prometheus) with OpenTelemetry
2930
- **MCP Integration** - AI assistant support for Claude Desktop and Cursor IDE
3031

3132
### Architecture diagrams
@@ -34,40 +35,43 @@ Perfect for AI operations teams, platform engineers, and business stakeholders w
3435

3536
## Requirements
3637

37-
| **Category** | **Component** | **Minimum** | **Recommended** |
38-
|--------------|---------------|-------------|-----------------|
39-
| **Hardware** | CPU Cores | 4 cores | 8 cores |
40-
| | Memory | 8 Gi RAM | 16 Gi RAM |
41-
| | Storage | 20Gi | 50Gi |
42-
| | GPU | Optional | GPU nodes (for DCGM metrics) |
43-
| **Software** | OpenShift | 4.16.24 or later | 4.16.24 or later |
44-
| | OpenShift AI | 2.16.2 or later | 2.16.2 or later |
45-
| | Service Mesh | Red Hat OpenShift Service Mesh | Red Hat OpenShift Service Mesh |
46-
| | Serverless | Red Hat OpenShift Serverless | Red Hat OpenShift Serverless |
47-
| **Tools** | CLI | `oc` CLI | `oc` CLI + `helm` v3.x + `yq` |
48-
| | Monitoring | Prometheus/Thanos | Prometheus/Thanos |
49-
| **Permissions** | User Access | Standard user with project admin | Standard user with project admin |
50-
| **Optional** | GPU Monitoring | - | DCGM exporter |
51-
| | Alerting | - | Slack Webhook URL |
52-
| | Tracing | - | OpenTelemetry + Tempo Operators |
53-
38+
| **Category** | **Component** | **Minimum** | **Recommended** |
39+
| --------------- | -------------- | -------------------------------- | ------------------------------------------------ |
40+
| **Hardware** | CPU Cores | 4 cores | 8 cores |
41+
| | Memory | 8 Gi RAM | 16 Gi RAM |
42+
| | Storage | 20Gi | 50Gi |
43+
| | GPU | Optional | GPU nodes (for DCGM metrics) |
44+
| **Software** | OpenShift | 4.16.24 or later | 4.16.24 or later |
45+
| | OpenShift AI | 2.16.2 or later | 2.16.2 or later |
46+
| | Service Mesh | Red Hat OpenShift Service Mesh | Red Hat OpenShift Service Mesh |
47+
| | Serverless | Red Hat OpenShift Serverless | Red Hat OpenShift Serverless |
48+
| **Tools** | CLI | `oc` CLI | `oc` CLI + `helm` v3.x + `yq` |
49+
| | Monitoring | Prometheus/Thanos | Prometheus/Thanos |
50+
| **Permissions** | User Access | Standard user with project admin | Standard user with project admin |
51+
| **Optional** | GPU Monitoring | - | DCGM exporter |
52+
| | Alerting | - | Slack Webhook URL |
53+
| | Observability | - | OpenTelemetry + Tempo + Logging + Loki Operators |
5454

5555
## Deploy
5656

5757
### Installing the OpenShift AI Observability Summarizer
5858

5959
Use the included `Makefile` to install everything:
60+
6061
```bash
6162
make install NAMESPACE=your-namespace
6263
```
63-
This will install the project with the default LLM deployment, `llama-3-1-8b-instruct`.
64+
65+
This will install the project with the default LLM deployment, `llama-3-2-3b-instruct`.
6466

6567
### Choosing different models during installation
6668

6769
To see all available models:
70+
6871
```bash
6972
make list-models
7073
```
74+
7175
```
7276
(Output)
7377
model: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)
@@ -78,32 +82,39 @@ model: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)
7882
model: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)
7983
model: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)
8084
```
85+
8186
You can use the `LLM` flag during installation to set a model from this list for deployment:
87+
8288
```
83-
make install NAMESPACE=your-namespace LLM=llama-3-1-8b-instruct
89+
make install NAMESPACE=your-namespace LLM=llama-3-2-3b-instruct
8490
```
8591

8692
### With GPU tolerations
93+
8794
```bash
8895
make install NAMESPACE=your-namespace LLM=llama-3-1-8b-instruct LLM_TOLERATION="nvidia.com/gpu"
8996
```
9097

9198
### With alerting if you want to send on SLACK
99+
92100
```bash
93101
make install NAMESPACE=your-namespace ALERTS=TRUE
94102
```
103+
95104
Enabling alerting will deploy alert rules, a cron job to monitor vLLM metrics, and AI-powered Slack notifications.
96105

97106
### Accessing the Application
98107

99108
The default configuration deploys:
109+
100110
- **llm-service** - LLM inference
101111
- **llama-stack** - Backend API
102112
- **metric-ui** - Multi-dashboard Streamlit interface
103113
- **mcp-server** - Model Context Protocol server for metrics analysis, report generation, and AI assistant integration
104114
- **OpenTelemetry Collector** - Distributed tracing collection
105115
- **Tempo** - Trace storage and analysis
106-
- **MinIO** - Object storage for traces
116+
- **Loki** - Centralized log aggregation and querying
117+
- **MinIO** - Object storage for traces and logs
107118

108119
Navigate to your **OpenShift Cluster → Networking → Routes** to find the application URL(s). You can also navigate to **Observe → Traces** in the OpenShift console to view traces.
109120

@@ -116,18 +127,21 @@ NAME HOST/PORT
116127
metric-ui-route metric-ui-route-llama-1.apps.tsisodia-spark.2vn8.p1.openshiftapps.com metric-ui-svc 8501 edge/Redirect None
117128
```
118129

119-
### OpenShift Summarizer Dashboard
130+
### OpenShift Summarizer Dashboard
131+
120132
![UI](docs/images/os.png)
121133

122-
### vLLM Summarizer Dashboard
134+
### vLLM Summarizer Dashboard
135+
123136
![UI](docs/images/vllm.png)
124137

125-
### Chat with Prometheus
138+
### Chat with Prometheus
139+
126140
![UI](docs/images/chat.png)
127141

128-
### Report Generated
129-
![UI](docs/images/report.png)
142+
### Report Generated
130143

144+
![UI](docs/images/report.png)
131145

132146
To uninstall:
133147

@@ -137,13 +151,14 @@ make uninstall NAMESPACE=your-namespace
137151

138152
### References
139153

140-
* Built on [Prometheus](https://prometheus.io/) and [Thanos](https://thanos.io/) for metrics collection
141-
* Uses [vLLM](https://github.com/vllm-project/vllm) for model serving
142-
* Powered by [Streamlit](https://streamlit.io/) for the web interface
143-
* Integrates with [OpenTelemetry](https://opentelemetry.io/) for distributed tracing
154+
- Built on [Prometheus](https://prometheus.io/) and [Thanos](https://thanos.io/) for metrics collection
155+
- Uses [Loki](https://grafana.com/oss/loki/) for centralized log aggregation
156+
- Uses [vLLM](https://github.com/vllm-project/vllm) for model serving
157+
- Powered by [Streamlit](https://streamlit.io/) for the web interface
158+
- Integrates with [OpenTelemetry](https://opentelemetry.io/) for distributed tracing and observability
144159

145160
## Tags
146161

147-
* **Industry:** Cross-industry
148-
* **Product:** OpenShift AI
149-
* **Use case:** AI Operations, Observability, Monitoring
162+
- **Industry:** Cross-industry
163+
- **Product:** OpenShift AI
164+
- **Use case:** AI Operations, Observability, Monitoring
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
apiVersion: v2
2+
name: loki-stack
3+
description: A Helm chart for LokiStack with MinIO storage integration for OpenShift AI Observability
4+
type: application
5+
version: 0.1.0
6+
appVersion: "2.9.0"
7+
8+
keywords:
9+
- loki
10+
- logging
11+
- observability
12+
- openshift
13+
- lokistack
14+
- minio
15+
- storage
16+
- opentelemetry
17+
18+
maintainers:
19+
- name: OpenShift AI Observability Team
20+
21+
dependencies: []
22+
23+
annotations:
24+
category: Observability
25+
description: "LokiStack deployment with shared MinIO storage backend"

0 commit comments

Comments
 (0)