Skip to content

Commit 3a6f6ee

Browse files
committed
Update docs
1 parent a3adb20 commit 3a6f6ee

19 files changed

+2194
-223
lines changed
Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# Key-Value Store Architecture
2+
3+
Flight Control uses Redis as a key-value store for two primary purposes: **caching external configuration data** and **managing an event-driven task queue**. This document describes both use cases and the resilience mechanisms that ensure system reliability.
4+
5+
## Overview
6+
7+
The key-value store serves as:
8+
1. **Cache Layer**: Stores external configuration data (Git repositories, HTTP endpoints, Kubernetes secrets)
9+
2. **Event Queue**: Manages asynchronous task processing through Redis Streams
10+
3. **Resilience Backend**: Provides automatic recovery from failures
11+
12+
## Caching External Configuration Data
13+
14+
Flight Control caches external configuration sources to improve performance and reduce load on external systems. The cache is organized by organization, fleet, and template version to ensure proper isolation.
15+
16+
### Cached Data Types
17+
18+
| Data Type | Key Pattern | Description |
19+
|-----------|-------------|-------------|
20+
| **Git Repository URLs** | `v1/{orgId}/{fleet}/{templateVersion}/repo-url/{repository}` | Repository URL mappings |
21+
| **Git Revisions** | `v1/{orgId}/{fleet}/{templateVersion}/git-hash/{repository}/{targetRevision}` | Git commit hashes for specific revisions |
22+
| **Git File Contents** | `v1/{orgId}/{fleet}/{templateVersion}/git-data/{repository}/{targetRevision}/{path}` | Actual file contents from Git repositories |
23+
| **Kubernetes Secrets** | `v1/{orgId}/{fleet}/{templateVersion}/k8ssecret-data/{namespace}/{name}` | Secret data from Kubernetes clusters |
24+
| **HTTP Response Data** | `v1/{orgId}/{fleet}/{templateVersion}/http-data/{md5(url)}` | Content fetched from HTTP endpoints |
25+
26+
### Cache Behavior
27+
28+
- **Cache Keys**: Automatically scoped by organization, fleet, and template version
29+
- **Cache Invalidation**: Keys are deleted when template versions change
30+
- **Cache Miss Handling**: External sources are fetched on-demand when cache misses occur
31+
- **Atomic Operations**: Uses a custom Lua script to implement get-or-set-if-not-exists behavior, preventing race conditions during concurrent access
32+
33+
### Important Cache Considerations
34+
35+
⚠️ **Cache Consistency Warning**: When external configuration sources change without updating their references (branch/tag/URL), devices may experience configuration drift:
36+
37+
- **Before cache deletion**: Some devices get `value1` from cache
38+
- **After cache deletion**: Other devices get `value2` from fresh fetch
39+
- **Result**: Inconsistent device configurations across the fleet
40+
41+
**Best Practice**: Always update branch names, tags, or URLs when changing external configuration content to ensure cache consistency.
42+
43+
## Event-Driven Task Queue
44+
45+
Flight Control uses Redis Streams with consumer groups to process events asynchronously. Events are published to the `task-queue` stream and processed by worker components.
46+
47+
### Event Processing Flow
48+
49+
```mermaid
50+
flowchart TD
51+
A[API Event Created] --> B[Event Published to task-queue]
52+
B --> C[Worker Consumes Event]
53+
C --> D{Event Type Analysis}
54+
55+
D --> E[Fleet Rollout Task]
56+
D --> F[Fleet Selector Matching Task]
57+
D --> G[Fleet Validation Task]
58+
D --> H[Device Render Task]
59+
D --> I[Repository Update Task]
60+
61+
E --> J[Task Completion]
62+
F --> J
63+
G --> J
64+
H --> J
65+
I --> J
66+
67+
J --> K[Event Acknowledged]
68+
K --> L[Checkpoint Advanced]
69+
```
70+
71+
### Event-to-Task Mapping
72+
73+
| Task | Triggering Events | Description |
74+
|------|------------------|-------------|
75+
| **Fleet Rollout** | • Device owner/labels updated<br/>• Device created<br/>• Fleet rollout batch dispatched<br/>• Fleet rollout started (immediate strategy) | Manages device configuration updates according to fleet templates |
76+
| **Fleet Selector Matching** | • Fleet label selector updated<br/>• Fleet created/deleted<br/>• Device created<br/>• Device labels updated | Matches devices to fleets based on label selectors |
77+
| **Fleet Validation** | • Fleet template updated<br/>• Fleet created<br/>• Referenced repository updated | Validates fleet templates and creates template versions |
78+
| **Device Render** | • Device spec updated<br/>• Device created<br/>• Fleet rollout device selected<br/>• Referenced repository updated | Renders device configurations from templates |
79+
| **Repository Update** | • Repository spec updated<br/>• Repository created | Updates repository references and invalidates related caches |
80+
81+
### Queue Management Features
82+
83+
- **Consumer Groups**: Automatic message tracking and load balancing
84+
- **Message Acknowledgment**: Messages are acknowledged after successful processing
85+
- **Timeout Handling**: Messages that exceed processing timeout are automatically retried
86+
- **Failed Message Handling**: Failures are retried with exponential backoff until a maximum number of retries, after which an event is emitted notifying about a permanent failure
87+
- **Checkpoint Tracking**: Global checkpoint ensures no message loss during failures
88+
89+
## Resilience and Recovery
90+
91+
Flight Control implements a **dual-persistence architecture** to ensure no event loss during Redis failures (at-least-once delivery; duplicate processing possible):
92+
93+
94+
### Architecture Components
95+
96+
1. **Redis Streams**: Primary queue for fast event processing
97+
2. **PostgreSQL Database**: Persistent storage for events and checkpoints
98+
3. **Recovery Mechanism**: Automatic event republishing from database
99+
100+
### Recovery Process
101+
102+
When Redis fails or is restarted:
103+
104+
1. **Checkpoint Detection**: System detects missing Redis checkpoint
105+
2. **Database Checkpoint Retrieval**: Last known checkpoint is retrieved from PostgreSQL
106+
3. **Event Republishing**: All events since the last checkpoint are republished to Redis
107+
4. **Queue Restoration**: Fresh Redis instance receives all missed events
108+
5. **Normal Operation**: Processing resumes; events since the checkpoint may be reprocessed. Handlers must be idempotent.
109+
110+
Note: The replay window equals “now - last persisted checkpoint”. Increase checkpoint persistence frequency to shorten replay/duplication.
111+
112+
### Recovery Limitations
113+
114+
⚠️ **Important**: The resilience mechanism only replays **events in the queue**. It does not restore cached external configuration data:
115+
116+
- **Events**: Automatically republished from PostgreSQL database
117+
- **Cache Data**: Must be re-fetched from external sources (Git, HTTP, Kubernetes)
118+
- **Cache Invalidation**: Occurs automatically when template versions change
119+
120+
## Redis Memory Configuration and Tuning
121+
122+
Flight Control uses Redis as an in-memory store, which requires careful memory management to prevent unbounded growth and ensure system stability.
123+
124+
### Memory Configuration Parameters
125+
126+
Redis memory usage is controlled by two key parameters:
127+
128+
| Parameter | Description | Default | Tuning Guidance |
129+
|-----------|-------------|---------|-----------------|
130+
| **maxmemory** | Total memory limit for Redis | `1gb` | Set to 70-80% of available container memory |
131+
| **maxmemory-policy** | Eviction policy when limit reached | `allkeys-lru` | See policy recommendations below |
132+
133+
### Memory Eviction Policies
134+
135+
Choose the appropriate eviction policy based on your use case:
136+
137+
| Policy | Description | Use Case | Recommendation |
138+
|--------|-------------|----------|----------------|
139+
| **allkeys-lru** | Evict least recently used keys | General caching (default) |**Recommended** for most deployments |
140+
| **allkeys-lfu** | Evict least frequently used keys | Long-running caches | Good for stable workloads |
141+
| **volatile-lru** | Evict LRU keys with expiration | Mixed cache/queue data | Use if some keys have TTL |
142+
| **noeviction** | Return errors when limit reached | Critical data preservation |**Not recommended** - causes failures |
143+
144+
### Memory Usage Patterns
145+
146+
Understanding Redis memory usage helps with proper sizing:
147+
148+
#### Cache Data (Primary Memory Consumer)
149+
- **Git repository contents**: Large files, multiple versions
150+
- **HTTP response data**: External API responses
151+
- **Kubernetes secrets**: Configuration data
152+
- **Template rendering results**: Processed configurations
153+
154+
#### Queue Data (Secondary Memory Consumer)
155+
- **Task queue messages**: Event processing data
156+
- **Failed message retry queue**: Exponential backoff storage
157+
- **In-flight task tracking**: Processing state management
158+
159+
#### Podman Environment Variables
160+
```bash
161+
# Set environment variables before starting containers
162+
export REDIS_MAXMEMORY="2gb"
163+
export REDIS_MAXMEMORY_POLICY="allkeys-lru"
164+
export REDIS_LOGLEVEL="warning"
165+
```
166+
167+
### Memory Monitoring and Tuning
168+
169+
#### Key Metrics to Monitor
170+
1. **Redis memory usage**: `redis-cli INFO memory`
171+
2. **Evicted keys count**: `redis-cli INFO stats | grep evicted`
172+
3. **Cache hit ratio**: Monitor cache effectiveness
173+
4. **Queue depth**: Monitor task processing backlog
174+
175+
#### Tuning Guidelines
176+
177+
**Increase memory if:**
178+
- High eviction rates (keys being removed frequently)
179+
- Cache hit ratio below 80%
180+
- Queue processing delays due to memory pressure
181+
182+
**Decrease memory if:**
183+
- Memory usage consistently below 50%
184+
- System has memory constraints
185+
- Other services need more memory
186+
187+
#### Memory Calculation Formula
188+
```
189+
Recommended Redis Memory =
190+
(Available Container Memory × 0.75) - 200MB
191+
```
192+
193+
Where:
194+
- `0.75` = 75% of container memory for Redis
195+
- `200MB` = Buffer for Redis overhead and OS
196+
197+
### Configuration Examples
198+
199+
#### Helm Chart Configuration
200+
```yaml
201+
# values.yaml
202+
kv:
203+
enabled: true
204+
maxmemory: "2gb"
205+
maxmemoryPolicy: "allkeys-lru"
206+
loglevel: "warning"
207+
resources:
208+
requests:
209+
memory: "2.5Gi" # Container memory should be > maxmemory
210+
cpu: "1000m"
211+
```
212+
213+
#### Podman Container Configuration
214+
```ini
215+
# flightctl-kv.container
216+
[Container]
217+
Environment=REDIS_MAXMEMORY=2gb
218+
Environment=REDIS_MAXMEMORY_POLICY=allkeys-lru
219+
Environment=REDIS_LOGLEVEL=warning
220+
```
221+
222+
### Troubleshooting Memory Issues
223+
224+
#### Common Problems and Solutions
225+
226+
**Problem**: Redis running out of memory
227+
```
228+
Error: OOM command not allowed when used memory > 'maxmemory'
229+
```
230+
**Solution**: Increase `maxmemory` or improve eviction policy
231+
232+
**Problem**: High eviction rates
233+
```
234+
# Check eviction stats
235+
redis-cli INFO stats | grep evicted
236+
```
237+
**Solution**: Increase memory allocation or optimize cache usage
238+
239+
**Problem**: Slow queue processing
240+
**Solution**: Monitor queue depth and increase memory if needed

docs/developer/architecture/tasks.md

Lines changed: 0 additions & 44 deletions
This file was deleted.

docs/developer/index.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: "Developer Documentation"
33
---
44

5-
# Flight Control Developer Documentation
5+
# Developer Documentation
66
Flight Control is a service for declarative, GitOps-driven management of edge device fleets running [ostree-based](https://github.com/ostreedev/ostree) Linux system images.
77

88
> [!NOTE]
@@ -11,7 +11,7 @@ Flight Control is a service for declarative, GitOps-driven management of edge de
1111
## Building
1212

1313
Prerequisites:
14-
* `git`, `make`, and `go` (>= 1.23), `openssl`, `openssl-devel`, `buildah`, `podman`, `podman-compose`, and `go-rpm-macros` (in case one needs to build RPM's)
14+
* `git`, `make`, and `go` (>= 1.23), `openssl`, `openssl-devel`, `buildah`, `podman`, `podman-compose`, `container-selinux` (>= 2.241) and `go-rpm-macros` (in case one needs to build RPM's)
1515

1616
Flightctl agent reports the status of running rootless containers. Ensure the podman socket is enabled:
1717

@@ -64,6 +64,7 @@ Use the `flightctl` CLI to login and then apply, get, or delete resources:
6464
bin/flightctl login $(cat ~/.flightctl/client.yaml | grep server | awk '{print $2}') --web --certificate-authority ~/.flightctl/certs/ca.crt
6565
bin/flightctl apply -f examples/fleet.yaml
6666
bin/flightctl get fleets
67+
bin/flightctl get fleet fleet1 fleet2 # Get multiple specific resources
6768
```
6869

6970
Note: If deployed without auth enabled, then there is no need to login.
@@ -94,9 +95,11 @@ make agent-vm agent-vm-console # user/password is user/user
9495
```
9596

9697
The agent-vm target accepts multiple parameters:
98+
9799
- VMNAME: the name of the VM to create (default: flightctl-device-default)
98100
- VMCPUS: the number of CPUs to allocate to the VM (default: 1)
99-
- VMMEM: the amount of memory to allocate to the VM (default: 512)
101+
- VMRAM: the amount of memory to allocate to the VM (default: 512)
102+
- VMDISKSIZE: the disk size for the VM (default: 10G)
100103
- VMWAIT: the amount of minutes to wait on the console during first boot (default: 0)
101104

102105
It is possible to create multiple VMs with different names:
@@ -154,3 +157,5 @@ make deploy-e2e-extras
154157
```
155158
156159
The Prometheus web UI is then accessible on `http://localhost:9090`
160+
161+
For detailed information about the metrics system architecture, see [Metrics Architecture](architecture/metrics.md).

0 commit comments

Comments
 (0)