Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,26 @@ ToolHive is available as a GUI desktop app, CLI, and Kubernetes Operator.
</tr>
</table>

## Kubernetes Operator

ToolHive includes a Kubernetes Operator for enterprise and production deployments:

### Features

- **MCPServer CRD**: Deploy and manage MCP servers as Kubernetes resources
- **MCPRegistry CRD** *(Experimental)*: Centralized registry management with automated sync
- **Secure isolation**: Container-based server execution with permission profiles
- **Protocol proxying**: Stdio servers exposed via HTTP/SSE networking protocols
- **Service discovery**: Automatic service creation and DNS integration

### Documentation

- [Operator Guide](cmd/thv-operator/README.md) - Complete operator documentation
- [MCPRegistry Reference](cmd/thv-operator/REGISTRY.md) - Registry management (experimental)
- [CRD API Reference](docs/operator/crd-api.md) - Auto-generated API documentation
- [Deployment Guide](docs/kind/deploying-toolhive-operator.md) - Step-by-step installation
- [Examples](examples/operator/) - Sample configurations

## Quick links

- 📚 [Documentation](https://docs.stacklok.com/toolhive/)
Expand Down
56 changes: 56 additions & 0 deletions cmd/thv-operator/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,62 @@ After modifying the CRDs, the following needs to be run:

When committing a change that changes CRDs, it is important to bump the chart version as described in the [CLAUDE.md](../../deploy/charts/operator-crds/CLAUDE.md#bumping-crd-chart) doc for the CRD Helm Chart.

## MCPRegistry CRD (Experimental)

The MCPRegistry CRD enables centralized management of MCP server registries. Requires `operator.features.experimental=true`.

### Key Components

- **CRD**: `api/v1alpha1/mcpregistry_types.go`
- **Controller**: `controllers/mcpregistry_controller.go`
- **Status**: `pkg/mcpregistrystatus/`
- **Sync**: `pkg/sync/`
- **Sources**: `pkg/sources/`
- **API**: `pkg/registryapi/`

### Development Patterns

#### Status Collector Pattern

Always use StatusCollector for batched updates:

```go
// ✅ Good: Collect all changes, apply once
statusCollector := mcpregistrystatus.NewCollector(mcpRegistry)
statusCollector.SetPhase(mcpv1alpha1.MCPRegistryPhaseReady)
statusCollector.Apply(ctx, r.Client)

// ❌ Bad: Multiple individual updates cause conflicts
r.Status().Update(ctx, mcpRegistry)
```

#### Error Handling

Always set status before returning errors:

```go
if err := validateSource(); err != nil {
statusCollector.SetSyncStatus(mcpv1alpha1.SyncPhaseFailed, err.Error(), ...)
return ctrl.Result{RequeueAfter: time.Minute * 5}, err
}
```

#### Source Handler Interface

```go
type SourceHandler interface {
FetchRegistryData(ctx context.Context, source MCPRegistrySource) (*RegistryData, error)
ValidateSource(ctx context.Context, source MCPRegistrySource) error
CalculateHash(ctx context.Context, source MCPRegistrySource) (string, error)
}
```

### Testing Patterns

- **Unit Tests**: Use mocks for external dependencies
- **Integration Tests**: Use envtest framework
- **E2E Tests**: Missing for MCPRegistry (use Chainsaw)

## OpenTelemetry (OTEL) Stack for Testing

When you have been asked to stand up an OTEL stack to test ToolHives integration inside of Kubernetes, you will need to perform the following tasks inside of the cluster that you have been instructed to use.
Expand Down
121 changes: 94 additions & 27 deletions cmd/thv-operator/DESIGN.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,111 @@
# Design & Decisions

This document aims to help fill in gaps of any decision that are made around the design of the ToolHive Operator.
This document captures architectural decisions and design patterns for the ToolHive Operator.

## CRD Attribute vs `PodTemplateSpec`
## Operator Design Principles

### CRD Attribute vs `PodTemplateSpec`

When building operators, the decision of when to use a `podTemplateSpec` and when to use a CRD attribute is always disputed. For the ToolHive Operator we have a defined rule of thumb.

### Use Dedicated CRD Attributes For:
#### Use Dedicated CRD Attributes For:
- **Business logic** that affects your operator's behavior
- **Validation requirements** (ranges, formats, constraints)
- **Validation requirements** (ranges, formats, constraints)
- **Cross-resource coordination** (affects Services, ConfigMaps, etc.)
- **Operator decision making** (triggers different reconciliation paths)

```yaml
spec:
version: "13.4" # Affects operator logic
replicas: 3 # Affects scaling behavior
backupSchedule: "0 2 * * *" # Needs validation
```

### Use PodTemplateSpec For:
#### Use PodTemplateSpec For:
- **Infrastructure concerns** (node selection, resources, affinity)
- **Sidecar containers**
- **Sidecar containers**
- **Standard Kubernetes pod configuration**
- **Things a cluster admin would typically configure**

```yaml
spec:
podTemplate:
spec:
nodeSelector:
disktype: ssd
containers:
- name: sidecar
image: monitoring:latest
```

## Quick Decision Test:
#### Quick Decision Test:
1. **"Does this affect my operator's reconciliation logic?"** -> Dedicated attribute
2. **"Is this standard Kubernetes pod configuration?"** -> PodTemplateSpec
2. **"Is this standard Kubernetes pod configuration?"** -> PodTemplateSpec
3. **"Do I need to validate this beyond basic Kubernetes validation?"** -> Dedicated attribute

This gives you a clean API for core functionality while maintaining flexibility for infrastructure concerns.
## MCPRegistry Architecture Decisions

### Status Management Design

**Decision**: Use batched status updates via StatusCollector pattern instead of individual field updates.

**Rationale**:
- Prevents race conditions between multiple status updates
- Reduces API server load with fewer update calls
- Ensures consistent status across reconciliation cycles
- Handles resource version conflicts gracefully

**Implementation**: StatusCollector interface collects all changes and applies them atomically.

### Sync Operation Design

**Decision**: Separate sync decision logic from sync execution with clear interfaces.

**Rationale**:
- Testability: Mock sync decisions independently from execution
- Flexibility: Different sync strategies without changing core logic
- Maintainability: Clear separation of concerns

**Key Patterns**:
- Idempotent operations for safe retry
- Manual vs automatic sync distinction
- Data preservation on failures

### Storage Architecture

**Decision**: Abstract storage via StorageManager interface with ConfigMap as default implementation.

**Rationale**:
- Future flexibility: Easy addition of new storage backends (OCI, databases)
- Testability: Mock storage for unit tests
- Consistency: Single interface for all storage operations

**Current Implementation**: ConfigMap-based with owner references for automatic cleanup.

### Registry API Service Pattern

**Decision**: Deploy individual API service per MCPRegistry rather than shared service.

**Rationale**:
- **Isolation**: Each registry has independent lifecycle and scaling
- **Security**: Per-registry access control possible
- **Reliability**: Failure of one registry doesn't affect others
- **Lifecycle Management**: Automatic cleanup via owner references

**Trade-offs**: More resources consumed but better isolation and security.

### Error Handling Strategy

**Decision**: Structured error types with progressive retry backoff.

**Rationale**:
- Different error types need different handling strategies
- Progressive backoff prevents thundering herd problems
- Structured errors enable better observability

**Implementation**: 5m initial retry, exponential backoff with cap, manual sync bypass.

### Performance Design Decisions

#### Resource Optimization
- **Status Updates**: Batched to reduce API calls (implemented)
- **Source Fetching**: Planned caching to avoid repeated downloads
- **API Deployment**: Lazy creation only when needed (implemented)

#### Memory Management
- **Git Operations**: Shallow clones to minimize disk usage (implemented)
- **Large Registries**: Stream processing planned for future
- **Status Objects**: Efficient field-level updates (implemented)

### Security Architecture

#### Permission Model
Minimal required permissions following principle of least privilege:
- ConfigMaps: For storage management
- Services/Deployments: For API service management
- MCPRegistry: For status updates

#### Network Security
Optional network policies for registry API access control in security-sensitive environments.
66 changes: 63 additions & 3 deletions cmd/thv-operator/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,34 @@
# ToolHive Kubernetes Operator

The ToolHive Kubernetes Operator manages MCP (Model Context Protocol) servers in Kubernetes clusters. It allows you to define MCP servers as Kubernetes resources and automates their deployment and management.
The ToolHive Kubernetes Operator manages MCP (Model Context Protocol) servers and registries in Kubernetes clusters. It allows you to define MCP servers and registries as Kubernetes resources and automates their deployment and management.

This operator is built using [Kubebuilder](https://book.kubebuilder.io/), a framework for building Kubernetes APIs using Custom Resource Definitions (CRDs).

## Overview

The operator introduces a new Custom Resource Definition (CRD) called `MCPServer` that represents an MCP server in Kubernetes. When you create an `MCPServer` resource, the operator automatically:
The operator introduces two main Custom Resource Definitions (CRDs):

### MCPServer
Represents an MCP server in Kubernetes. When you create an `MCPServer` resource, the operator automatically:

1. Creates a Deployment to run the MCP server
2. Sets up a Service to expose the MCP server
3. Configures the appropriate permissions and settings
4. Manages the lifecycle of the MCP server

### MCPRegistry (Experimental)

> ⚠️ **Experimental Feature**: MCPRegistry requires `ENABLE_EXPERIMENTAL_FEATURES=true`

Represents an MCP server registry in Kubernetes. When you create an `MCPRegistry` resource, the operator automatically:

1. Synchronizes registry data from various sources (ConfigMap, Git)
2. Deploys a Registry API service for server discovery
3. Provides content filtering and image validation
4. Manages automatic and manual synchronization policies

For detailed MCPRegistry documentation, see [REGISTRY.md](REGISTRY.md).

```mermaid
---
config:
Expand Down Expand Up @@ -107,7 +123,11 @@ helm upgrade -i toolhive-operator-crds oci://ghcr.io/stacklok/toolhive/toolhive-
2. Install the operator:

```bash
# Standard installation
helm upgrade -i <release_name> oci://ghcr.io/stacklok/toolhive/toolhive-operator --version=<version> -n toolhive-system --create-namespace

# OR with experimental features (for MCPRegistry support)
helm upgrade -i <release_name> oci://ghcr.io/stacklok/toolhive/toolhive-operator --version=<version> -n toolhive-system --create-namespace --set operator.features.experimental=true
```

## Usage
Expand Down Expand Up @@ -236,9 +256,49 @@ permissionProfile:

The ConfigMap should contain a JSON permission profile.

### Creating an MCP Registry (Experimental)

> ⚠️ **Requires**: `operator.features.experimental=true`

First, create a ConfigMap containing ToolHive registry data. The ConfigMap must be user-defined and is not managed by the operator:

```bash
# Create ConfigMap from existing registry data
kubectl create configmap my-registry-data --from-file registry.json=pkg/registry/data/registry.json -n toolhive-system

# Or create from your own registry file
kubectl create configmap my-registry-data --from-file registry.json=/path/to/your/registry.json -n toolhive-system
```

Then create the MCPRegistry resource that references the ConfigMap:

```yaml
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPRegistry
metadata:
name: my-registry
namespace: toolhive-system
spec:
displayName: "My MCP Registry"
source:
type: configmap
configmap:
name: my-registry-data # References the user-created ConfigMap
key: registry.json # Key in ConfigMap (default: "registry.json")
syncPolicy:
interval: "1h"
filter:
tags:
include: ["production"]
exclude: ["experimental"]
```

For complete MCPRegistry examples and documentation, see [REGISTRY.md](REGISTRY.md).

## Examples

See the `examples/operator/mcp-servers/` directory for example MCPServer resources.
- **MCPServer examples**: `examples/operator/mcp-servers/` directory
- **MCPRegistry examples**: `examples/operator/mcp-registries/` directory

## Development

Expand Down
Loading
Loading