Skip to content

Commit d8ef0b4

Browse files
authored
Proposal to refactor the MCP Registry controller (#2301)
* draft * draft * Plan for moving registry source handling logic to thv-registry-api Signed-off-by: Daniele Martinoli <[email protected]> * renamed to match validation pattern Signed-off-by: Daniele Martinoli <[email protected]> * reviewed API paths Signed-off-by: Daniele Martinoli <[email protected]> * added option 4 for sync updates Signed-off-by: Daniele Martinoli <[email protected]> * changed to PUT /registry/sync Signed-off-by: Daniele Martinoli <[email protected]> --------- Signed-off-by: Daniele Martinoli <[email protected]>
1 parent 88435a3 commit d8ef0b4

File tree

1 file changed

+322
-0
lines changed

1 file changed

+322
-0
lines changed
Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
# Refactor MCPRegistry controller
2+
## Requirement Overview
3+
4+
The MCPRegistry operator uses the `MCPRegistry` CRD to deploy Registry API servers pre-populated with curated MCP server definitions. The controller fetches registry data from multiple source types (`ConfigMap`, Git repositories, or external Registry APIs), applies filtering rules, and stores the result in a `ConfigMap` mounted by the Registry API server.
5+
6+
Currently, these data source management capabilities exist only in the Kubernetes operator. However, ToolHive users typically start with the CLI before deploying to Kubernetes. To provide a consistent experience, we need to move data source management into `thv-registry-api` itself, making these features available in both CLI and Kubernetes environments.
7+
8+
**Affected features:**
9+
- Data source configuration (ConfigMap, Git, Registry API)
10+
- Periodic and on-demand synchronization
11+
- Tag and name filtering
12+
13+
This proposal describes moving functionality from the `thv-operator` module ([toolhive](https://github.com/stacklok/toolhive)) to the `thv-registry-api` module ([toolhive-registry-server](https://github.com/stacklok/toolhive-registry-server)).
14+
15+
### Design Constraints
16+
17+
This refactoring maintains feature parity without introducing new capabilities:
18+
19+
- **API stability**: The `MCPRegistry` CRD spec remains unchanged
20+
- **Status evolution**: Status field changes will be documented if required
21+
- **CLI extension**: New runtime flags may be added to `thv-registry-api`
22+
- **Storage**: File-based storage will be implemented; database storage is out of scope
23+
24+
## Current status
25+
### Data source configuration
26+
`MCPRegistry` resources define data sources and filtering rules in the `spec.source` and `spec.filter` fields:
27+
```yaml
28+
spec:
29+
source:
30+
type: configmap
31+
configmap:
32+
name: minimal-registry-data
33+
syncPolicy:
34+
interval: "30m"
35+
filter:
36+
tags:
37+
include: ["database", "production"]
38+
exclude: ["experimental", "deprecated", "beta"]
39+
```
40+
### Current Implementation
41+
The `MCPRegistry` controller implements data source management in these operator packages:
42+
| Package| Purpose |
43+
|--------|---------|
44+
|`cmd/thv-operator/pkg/sources`|`SourceHandler` interface and implementations for `ConfigMap`, Git, and Registry API sources; StorageManager implementation using `ConfigMap` |
45+
|`cmd/thv-operator/pkg/httpclient`|HTTP client for Registry API sources|
46+
|`cmd/thv-operator/pkg/git`|Git repository operations|
47+
|`cmd/thv-operator/pkg/sync`|`Manager` interface for periodic and on-demand synchronization|
48+
|`cmd/thv-operator/pkg/filtering`|`FilterService` interface and implementation for tag and name filtering|
49+
50+
These packages run exclusively within the operator controller with no external dependencies.
51+
52+
### Controller Responsibilities
53+
The controller currently handles two distinct concerns:
54+
1. Deployment management: Creates and maintains Registry API server Deployments and Services
55+
1. Data synchronization: Fetches, filters, and stores registry data in `ConfigMap`s
56+
57+
After deploying the Registry API server, the controller does not communicate with it. Instead, the controller directly fetches registry data from the configured source (Git, ConfigMap, or remote API) and stores it in a `ConfigMap`. The deployed Registry API server reads from this same `ConfigMap` to serve requests from external clients.
58+
59+
### Status Reporting
60+
The MCPRegistry status tracks deployment, synchronization and storage state information.
61+
62+
#### API deployment status
63+
```yaml
64+
apiStatus:
65+
endpoint: http://thv-git-api.toolhive-system:8080
66+
phase: Ready
67+
readySince: "2025-10-22T08:30:38Z"
68+
```
69+
#### Synchronization status
70+
```yaml
71+
syncStatus:
72+
phase: Complete
73+
lastSyncTime: "2025-10-22T08:30:27Z"
74+
serverCount: 88
75+
lastSyncHash: c013b6b36286ab...
76+
```
77+
#### Storage reference
78+
```yaml
79+
storageRef:
80+
type: configmap
81+
configMapRef:
82+
name: thv-git-registry-storage
83+
```
84+
#### Standard conditions
85+
```yaml
86+
conditions:
87+
- type: APIReady
88+
status: "True"
89+
lastTransitionTime: "2025-10-22T08:30:38Z"
90+
```
91+
92+
## Refactoring Proposal
93+
94+
### Component Responsibilities
95+
96+
**Controller:**
97+
- Deploy and manage Registry API (`Deployment`, `Service`)
98+
- Create and update configuration `ConfigMap` from `MCPRegistry` spec
99+
- Monitor deployment health and update `apiStatus`
100+
- Trigger manual syncs via Registry API endpoint
101+
- Update `syncStatus`, see [Sync Status Updates](#sync-status-updates) options
102+
103+
**Registry API:**
104+
- Watch configuration file for changes
105+
- Execute periodic and manual synchronization
106+
- Fetch, filter, and store registry data
107+
- Expose sync status via HTTP endpoint
108+
109+
### Configuration Management
110+
111+
**Flow:**
112+
1. Controller creates `ConfigMap` `<registry-name>-config` containing `MCPRegistry` spec
113+
2. `ConfigMap` mounted at `/config/spec.yaml` in Registry API pod
114+
3. Registry API launched with `--config /config/spec.yaml` flag
115+
4. Registry API watches file and triggers sync on changes
116+
117+
### Storage
118+
119+
**File-based (Phase 1):**
120+
- Registry API stores data at path defined by `--storage-path` flag
121+
- Controller sets `--storage-path /data` using mounted `emptyDir` volume
122+
- Data format: ToolHive registry JSON schema (unchanged from current)
123+
- Persistence: Data survives pod restarts, lost on deployment deletion
124+
125+
**Database-backed (Future):** Deferred to separate proposal for multi-replica support.
126+
127+
### Sync Status Updates
128+
129+
**Options:**
130+
131+
**Option 1 - HTTP Polling:**
132+
- Registry API exposes `GET /registry/status` returning sync history, stored in the configured storage path
133+
- Controller polls every 60 seconds and updates `syncStatus`
134+
- Pros: Simple, no new CRDs, works for CLI and Kubernetes
135+
- Cons: 60-second status latency
136+
137+
**Option 2 - Event CRD:**
138+
- New `SyncCompletionEvent` CRD created by Registry API on sync completion
139+
- Controller watches events and updates `MCPRegistry.status.syncStatus`
140+
- Pros: Real-time updates
141+
- Cons: Requires RBAC, adds CRD complexity, Kubernetes-only
142+
143+
**Option 3 - Remove syncStatus:**
144+
- Remove `syncStatus` from `MCPRegistry` entirely
145+
- Users query Registry API `GET /registry/status` directly for sync information
146+
- Pros: Simplest implementation
147+
- Cons: Breaking change, loses kubectl visibility
148+
149+
**Option 4 - Hybrid polling (Recommended):**
150+
- Registry API exposes `GET /registry/status` returning sync history, stored in the configured storage path
151+
- Controller polls every 60 seconds and creates an instance of the new `SyncCompletionEvent` CRD on sync completion
152+
- Controller watches events and updates `MCPRegistry.status.syncStatus`
153+
- Pros: Anticipate changes for the final solution (option 2)
154+
- Cons: Adds CRD complexity, 60-second status latency
155+
156+
**Recommendation:** Option 4 provides best balance of simplicity and functionality.
157+
158+
**Status endpoint response:**
159+
```json
160+
{
161+
"lastSync": {
162+
"type": "automatic",
163+
"startTime": "2025-10-22T08:30:38Z",
164+
"completionTime": "2025-10-22T08:32:38Z",
165+
"hash": "c013b6b36286ab...",
166+
"serverCount": 88,
167+
"result": "completed",
168+
"attemptCount": 0,
169+
"errorMessage": null
170+
}
171+
}
172+
```
173+
174+
### Automatic Sync
175+
176+
- Registry API uses `time.Ticker` for periodic sync based on `syncPolicy.interval`
177+
- On each interval: check source hash, sync only if changed
178+
- Record result in sync history
179+
180+
### Manual Sync
181+
182+
- Registry API exposes `PUT /registry/sync` endpoint
183+
- Controller calls endpoint when `toolhive.stacklok.dev/sync-trigger` annotation changes
184+
- Endpoint returns 201 (created) or 409 (sync in progress)
185+
- Sync result recorded in history
186+
187+
### Sync Failure Handling
188+
189+
**Current behavior:**
190+
- Fixed 5-minute retry interval
191+
- Unlimited retry attempts
192+
- `syncStatus.attemptCount` increments on failure, resets on success
193+
194+
**Proposed (Phase 1):**
195+
Match current behavior - Registry API retries every 5 minutes indefinitely.
196+
197+
**Future enhancement:**
198+
Add configurable retry policy to `syncPolicy.retryPolicy` (tracked separately).
199+
200+
### Data Filtering
201+
202+
- Registry API applies name and tag filters during sync processing
203+
- Filter changes trigger immediate sync (hash-based detection)
204+
- Move `cmd/thv-operator/pkg/filtering` package to `thv-registry-api`
205+
206+
### API Status Updates
207+
208+
Controller continues managing `apiStatus` with no changes:
209+
- Monitors Deployment and Service health
210+
- Updates phase: NotStarted, Deploying, Ready, Unhealthy, Error
211+
- Existing `apiStatus` structure preserved
212+
213+
## Implementation Plan
214+
215+
### Phase 1: Move Data Source Management to Registry API
216+
217+
**Goal:** Full-functioning solution with sync logic in `thv-registry-api`, maintaining backward compatibility with existing MCPRegistry resources.
218+
219+
**Task Order Rationale:** Registry API must be enhanced first before operator changes, to avoid breaking existing deployments.
220+
221+
#### Registry API Development
222+
223+
1. **Add configuration file support**
224+
- Implement `--config` flag in `thv-registry-api serve` command
225+
- Add YAML configuration parsing for `source`, `syncPolicy`, and `filter` fields
226+
- Implement file watcher for configuration changes (using `fsnotify` or similar)
227+
- Graceful reload on configuration changes
228+
229+
2. **Add file-based storage**
230+
- Implement `--storage-path` flag for registry data location
231+
- Add file-based storage backend matching current ConfigMap JSON format
232+
- Atomic file writes (temp file + rename pattern)
233+
- Load existing data on startup if present
234+
235+
3. **Port source handler packages**
236+
- Copy `cmd/thv-operator/pkg/sources` to `thv-registry-api`
237+
- Copy `cmd/thv-operator/pkg/httpclient` to `thv-registry-api`
238+
- Copy `cmd/thv-operator/pkg/git` to `thv-registry-api`
239+
- Adapt packages to work without Kubernetes client dependencies
240+
- Update import paths
241+
242+
4. **Port sync manager**
243+
- Copy `cmd/thv-operator/pkg/sync` to `thv-registry-api`
244+
- Adapt sync logic to use file-based storage instead of ConfigMap
245+
- Implement retry logic (5-minute interval, unlimited attempts)
246+
- Remove Kubernetes-specific dependencies
247+
248+
5. **Port filtering logic**
249+
- Copy `cmd/thv-operator/pkg/filtering` to `thv-registry-api`
250+
- Keep filtering interface and implementation unchanged
251+
- Update import paths
252+
253+
6. **Implement automatic sync**
254+
- Add `time.Ticker` based periodic sync using `syncPolicy.interval`
255+
- Implement hash-based change detection before syncing
256+
- Store sync history in file at storage path
257+
258+
7. **Add sync status endpoint**
259+
- Implement `GET /registry/status` returning sync history
260+
- Return format matching proposed JSON schema
261+
- Read from sync history file
262+
263+
8. **Add manual sync endpoint**
264+
- Implement `POST /registry/sync` for manual sync triggering
265+
- Return 202 (accepted) or 409 (sync in progress)
266+
- Prevent concurrent syncs
267+
268+
9. **Add integration tests**
269+
- Test configuration file watching and reload
270+
- Test all source types (ConfigMap, Git, API)
271+
- Test filtering logic
272+
- Test automatic and manual sync
273+
- Test retry behavior on failures
274+
275+
#### Operator Updates
276+
277+
10. **Update operator to use new Registry API**
278+
- Modify MCPRegistry controller to create configuration ConfigMap
279+
- Update Deployment spec to mount config ConfigMap at `/config/spec.yaml`
280+
- Update Deployment spec to mount emptyDir volume at `/data`
281+
- Add `--config /config/spec.yaml --storage-path /data` flags to registry-api container
282+
- Add polling logic to call `GET /registry/status` every 60 seconds (depends on [selected update option](#sync-status-updates))
283+
- Update `syncStatus` from polling results
284+
285+
11. **Add manual sync trigger via API**
286+
- Detect `toolhive.stacklok.dev/sync-trigger` annotation changes
287+
- Call `POST /registry/sync` on Registry API endpoint
288+
- Update `lastManualSyncTrigger` after successful API call
289+
290+
12. **Remove sync logic from operator**
291+
- Remove sync execution code from MCPRegistry controller (keep status update only)
292+
- Mark `cmd/thv-operator/pkg/sources` as deprecated (keep for reference)
293+
- Mark `cmd/thv-operator/pkg/sync` as deprecated (keep for reference)
294+
- Mark `cmd/thv-operator/pkg/filtering` as deprecated (keep for reference)
295+
296+
13. **Update operator integration tests**
297+
- Update tests to verify controller creates config ConfigMap
298+
- Test controller polls Registry API status endpoint
299+
- Test manual sync via annotation
300+
- Test backward compatibility with existing MCPRegistry resources
301+
302+
14. **Update documentation**
303+
- Document new Registry API flags and endpoints
304+
- Update MCPRegistry CRD examples
305+
- Add migration guide for users (if any operator upgrade steps needed)
306+
307+
### Phase 2: Future Enhancements
308+
309+
**Goal:** Improvements deferred from Phase 1 to keep initial scope manageable.
310+
311+
1. **Configurable retry policy**
312+
- Add `spec.syncPolicy.retryPolicy` field to MCPRegistry CRD
313+
- Support `maxAttempts`, `retryInterval`, `backoff` configuration
314+
- Implement in Registry API sync logic
315+
- Update operator to pass retry config via ConfigMap
316+
317+
2. **Real-time status updates (if needed)**
318+
- Evaluate if 60-second polling latency is acceptable
319+
- If not: Implement Option 2 (Event CRD) from proposal
320+
- Add `SyncCompletionEvent` CRD
321+
- Update Registry API to create events
322+
- Update controller to watch events

0 commit comments

Comments
 (0)