|
| 1 | +# Key-Value Store Architecture |
| 2 | + |
| 3 | +Flight Control uses Redis as a key-value store for two primary purposes: **caching external configuration data** and **managing an event-driven task queue**. This document describes both use cases and the resilience mechanisms that ensure system reliability. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The key-value store serves as: |
| 8 | +1. **Cache Layer**: Stores external configuration data (Git repositories, HTTP endpoints, Kubernetes secrets) |
| 9 | +2. **Event Queue**: Manages asynchronous task processing through Redis Streams |
| 10 | +3. **Resilience Backend**: Provides automatic recovery from failures |
| 11 | + |
| 12 | +## Caching External Configuration Data |
| 13 | + |
| 14 | +Flight Control caches external configuration sources to improve performance and reduce load on external systems. The cache is organized by organization, fleet, and template version to ensure proper isolation. |
| 15 | + |
| 16 | +### Cached Data Types |
| 17 | + |
| 18 | +| Data Type | Key Pattern | Description | |
| 19 | +|-----------|-------------|-------------| |
| 20 | +| **Git Repository URLs** | `v1/{orgId}/{fleet}/{templateVersion}/repo-url/{repository}` | Repository URL mappings | |
| 21 | +| **Git Revisions** | `v1/{orgId}/{fleet}/{templateVersion}/git-hash/{repository}/{targetRevision}` | Git commit hashes for specific revisions | |
| 22 | +| **Git File Contents** | `v1/{orgId}/{fleet}/{templateVersion}/git-data/{repository}/{targetRevision}/{path}` | Actual file contents from Git repositories | |
| 23 | +| **Kubernetes Secrets** | `v1/{orgId}/{fleet}/{templateVersion}/k8ssecret-data/{namespace}/{name}` | Secret data from Kubernetes clusters | |
| 24 | +| **HTTP Response Data** | `v1/{orgId}/{fleet}/{templateVersion}/http-data/{md5(url)}` | Content fetched from HTTP endpoints | |
| 25 | + |
| 26 | +### Cache Behavior |
| 27 | + |
| 28 | +- **Cache Keys**: Automatically scoped by organization, fleet, and template version |
| 29 | +- **Cache Invalidation**: Keys are deleted when template versions change |
| 30 | +- **Cache Miss Handling**: External sources are fetched on-demand when cache misses occur |
| 31 | +- **Atomic Operations**: Uses a custom Lua script to implement get-or-set-if-not-exists behavior, preventing race conditions during concurrent access |
| 32 | + |
| 33 | +### Important Cache Considerations |
| 34 | + |
| 35 | +⚠️ **Cache Consistency Warning**: When external configuration sources change without updating their references (branch/tag/URL), devices may experience configuration drift: |
| 36 | + |
| 37 | +- **Before cache deletion**: Some devices get `value1` from cache |
| 38 | +- **After cache deletion**: Other devices get `value2` from fresh fetch |
| 39 | +- **Result**: Inconsistent device configurations across the fleet |
| 40 | + |
| 41 | +**Best Practice**: Always update branch names, tags, or URLs when changing external configuration content to ensure cache consistency. |
| 42 | + |
| 43 | +## Event-Driven Task Queue |
| 44 | + |
| 45 | +Flight Control uses Redis Streams with consumer groups to process events asynchronously. Events are published to the `task-queue` stream and processed by worker components. |
| 46 | + |
| 47 | +### Event Processing Flow |
| 48 | + |
| 49 | +```mermaid |
| 50 | +flowchart TD |
| 51 | + A[API Event Created] --> B[Event Published to task-queue] |
| 52 | + B --> C[Worker Consumes Event] |
| 53 | + C --> D{Event Type Analysis} |
| 54 | + |
| 55 | + D --> E[Fleet Rollout Task] |
| 56 | + D --> F[Fleet Selector Matching Task] |
| 57 | + D --> G[Fleet Validation Task] |
| 58 | + D --> H[Device Render Task] |
| 59 | + D --> I[Repository Update Task] |
| 60 | + |
| 61 | + E --> J[Task Completion] |
| 62 | + F --> J |
| 63 | + G --> J |
| 64 | + H --> J |
| 65 | + I --> J |
| 66 | + |
| 67 | + J --> K[Event Acknowledged] |
| 68 | + K --> L[Checkpoint Advanced] |
| 69 | +``` |
| 70 | + |
| 71 | +### Event-to-Task Mapping |
| 72 | + |
| 73 | +| Task | Triggering Events | Description | |
| 74 | +|------|------------------|-------------| |
| 75 | +| **Fleet Rollout** | • Device owner/labels updated<br/>• Device created<br/>• Fleet rollout batch dispatched<br/>• Fleet rollout started (immediate strategy) | Manages device configuration updates according to fleet templates | |
| 76 | +| **Fleet Selector Matching** | • Fleet label selector updated<br/>• Fleet created/deleted<br/>• Device created<br/>• Device labels updated | Matches devices to fleets based on label selectors | |
| 77 | +| **Fleet Validation** | • Fleet template updated<br/>• Fleet created<br/>• Referenced repository updated | Validates fleet templates and creates template versions | |
| 78 | +| **Device Render** | • Device spec updated<br/>• Device created<br/>• Fleet rollout device selected<br/>• Referenced repository updated | Renders device configurations from templates | |
| 79 | +| **Repository Update** | • Repository spec updated<br/>• Repository created | Updates repository references and invalidates related caches | |
| 80 | + |
| 81 | +### Queue Management Features |
| 82 | + |
| 83 | +- **Consumer Groups**: Automatic message tracking and load balancing |
| 84 | +- **Message Acknowledgment**: Messages are acknowledged after successful processing |
| 85 | +- **Timeout Handling**: Messages that exceed processing timeout are automatically retried |
| 86 | +- **Failed Message Handling**: Failures are retried with exponential backoff until a maximum number of retries, after which an event is emitted notifying about a permanent failure |
| 87 | +- **Checkpoint Tracking**: Global checkpoint ensures no message loss during failures |
| 88 | + |
| 89 | +## Resilience and Recovery |
| 90 | + |
| 91 | +Flight Control implements a **dual-persistence architecture** to ensure no event loss during Redis failures (at-least-once delivery; duplicate processing possible): |
| 92 | + |
| 93 | + |
| 94 | +### Architecture Components |
| 95 | + |
| 96 | +1. **Redis Streams**: Primary queue for fast event processing |
| 97 | +2. **PostgreSQL Database**: Persistent storage for events and checkpoints |
| 98 | +3. **Recovery Mechanism**: Automatic event republishing from database |
| 99 | + |
| 100 | +### Recovery Process |
| 101 | + |
| 102 | +When Redis fails or is restarted: |
| 103 | + |
| 104 | +1. **Checkpoint Detection**: System detects missing Redis checkpoint |
| 105 | +2. **Database Checkpoint Retrieval**: Last known checkpoint is retrieved from PostgreSQL |
| 106 | +3. **Event Republishing**: All events since the last checkpoint are republished to Redis |
| 107 | +4. **Queue Restoration**: Fresh Redis instance receives all missed events |
| 108 | +5. **Normal Operation**: Processing resumes; events since the checkpoint may be reprocessed. Handlers must be idempotent. |
| 109 | + |
| 110 | +Note: The replay window equals “now - last persisted checkpoint”. Increase checkpoint persistence frequency to shorten replay/duplication. |
| 111 | + |
| 112 | +### Recovery Limitations |
| 113 | + |
| 114 | +⚠️ **Important**: The resilience mechanism only replays **events in the queue**. It does not restore cached external configuration data: |
| 115 | + |
| 116 | +- **Events**: Automatically republished from PostgreSQL database |
| 117 | +- **Cache Data**: Must be re-fetched from external sources (Git, HTTP, Kubernetes) |
| 118 | +- **Cache Invalidation**: Occurs automatically when template versions change |
| 119 | + |
| 120 | +## Redis Memory Configuration and Tuning |
| 121 | + |
| 122 | +Flight Control uses Redis as an in-memory store, which requires careful memory management to prevent unbounded growth and ensure system stability. |
| 123 | + |
| 124 | +### Memory Configuration Parameters |
| 125 | + |
| 126 | +Redis memory usage is controlled by two key parameters: |
| 127 | + |
| 128 | +| Parameter | Description | Default | Tuning Guidance | |
| 129 | +|-----------|-------------|---------|-----------------| |
| 130 | +| **maxmemory** | Total memory limit for Redis | `1gb` | Set to 70-80% of available container memory | |
| 131 | +| **maxmemory-policy** | Eviction policy when limit reached | `allkeys-lru` | See policy recommendations below | |
| 132 | + |
| 133 | +### Memory Eviction Policies |
| 134 | + |
| 135 | +Choose the appropriate eviction policy based on your use case: |
| 136 | + |
| 137 | +| Policy | Description | Use Case | Recommendation | |
| 138 | +|--------|-------------|----------|----------------| |
| 139 | +| **allkeys-lru** | Evict least recently used keys | General caching (default) | ✅ **Recommended** for most deployments | |
| 140 | +| **allkeys-lfu** | Evict least frequently used keys | Long-running caches | Good for stable workloads | |
| 141 | +| **volatile-lru** | Evict LRU keys with expiration | Mixed cache/queue data | Use if some keys have TTL | |
| 142 | +| **noeviction** | Return errors when limit reached | Critical data preservation | ❌ **Not recommended** - causes failures | |
| 143 | + |
| 144 | +### Memory Usage Patterns |
| 145 | + |
| 146 | +Understanding Redis memory usage helps with proper sizing: |
| 147 | + |
| 148 | +#### Cache Data (Primary Memory Consumer) |
| 149 | +- **Git repository contents**: Large files, multiple versions |
| 150 | +- **HTTP response data**: External API responses |
| 151 | +- **Kubernetes secrets**: Configuration data |
| 152 | +- **Template rendering results**: Processed configurations |
| 153 | + |
| 154 | +#### Queue Data (Secondary Memory Consumer) |
| 155 | +- **Task queue messages**: Event processing data |
| 156 | +- **Failed message retry queue**: Exponential backoff storage |
| 157 | +- **In-flight task tracking**: Processing state management |
| 158 | + |
| 159 | +#### Podman Environment Variables |
| 160 | +```bash |
| 161 | +# Set environment variables before starting containers |
| 162 | +export REDIS_MAXMEMORY="2gb" |
| 163 | +export REDIS_MAXMEMORY_POLICY="allkeys-lru" |
| 164 | +export REDIS_LOGLEVEL="warning" |
| 165 | +``` |
| 166 | + |
| 167 | +### Memory Monitoring and Tuning |
| 168 | + |
| 169 | +#### Key Metrics to Monitor |
| 170 | +1. **Redis memory usage**: `redis-cli INFO memory` |
| 171 | +2. **Evicted keys count**: `redis-cli INFO stats | grep evicted` |
| 172 | +3. **Cache hit ratio**: Monitor cache effectiveness |
| 173 | +4. **Queue depth**: Monitor task processing backlog |
| 174 | + |
| 175 | +#### Tuning Guidelines |
| 176 | + |
| 177 | +**Increase memory if:** |
| 178 | +- High eviction rates (keys being removed frequently) |
| 179 | +- Cache hit ratio below 80% |
| 180 | +- Queue processing delays due to memory pressure |
| 181 | + |
| 182 | +**Decrease memory if:** |
| 183 | +- Memory usage consistently below 50% |
| 184 | +- System has memory constraints |
| 185 | +- Other services need more memory |
| 186 | + |
| 187 | +#### Memory Calculation Formula |
| 188 | +``` |
| 189 | +Recommended Redis Memory = |
| 190 | + (Available Container Memory × 0.75) - 200MB |
| 191 | +``` |
| 192 | + |
| 193 | +Where: |
| 194 | +- `0.75` = 75% of container memory for Redis |
| 195 | +- `200MB` = Buffer for Redis overhead and OS |
| 196 | + |
| 197 | +### Configuration Examples |
| 198 | + |
| 199 | +#### Helm Chart Configuration |
| 200 | +```yaml |
| 201 | +# values.yaml |
| 202 | +kv: |
| 203 | + enabled: true |
| 204 | + maxmemory: "2gb" |
| 205 | + maxmemoryPolicy: "allkeys-lru" |
| 206 | + loglevel: "warning" |
| 207 | + resources: |
| 208 | + requests: |
| 209 | + memory: "2.5Gi" # Container memory should be > maxmemory |
| 210 | + cpu: "1000m" |
| 211 | +``` |
| 212 | +
|
| 213 | +#### Podman Container Configuration |
| 214 | +```ini |
| 215 | +# flightctl-kv.container |
| 216 | +[Container] |
| 217 | +Environment=REDIS_MAXMEMORY=2gb |
| 218 | +Environment=REDIS_MAXMEMORY_POLICY=allkeys-lru |
| 219 | +Environment=REDIS_LOGLEVEL=warning |
| 220 | +``` |
| 221 | + |
| 222 | +### Troubleshooting Memory Issues |
| 223 | + |
| 224 | +#### Common Problems and Solutions |
| 225 | + |
| 226 | +**Problem**: Redis running out of memory |
| 227 | +``` |
| 228 | +Error: OOM command not allowed when used memory > 'maxmemory' |
| 229 | +``` |
| 230 | +**Solution**: Increase `maxmemory` or improve eviction policy |
| 231 | + |
| 232 | +**Problem**: High eviction rates |
| 233 | +``` |
| 234 | +# Check eviction stats |
| 235 | +redis-cli INFO stats | grep evicted |
| 236 | +``` |
| 237 | +**Solution**: Increase memory allocation or optimize cache usage |
| 238 | + |
| 239 | +**Problem**: Slow queue processing |
| 240 | +**Solution**: Monitor queue depth and increase memory if needed |
0 commit comments