agent-system/docs/runtime-supervisor.md
2026-05-12 17:30:00 +02:00

71 lines
3.9 KiB
Markdown

# Runtime Supervisor
The Runtime Supervisor is a reconciliation-oriented component responsible for ensuring the homelab infrastructure aligns with its desired state. It operates in a recommendation-only mode, detecting drifts and proposing actions without directly mutating the runtime.
## Reconciliation Loop
The supervisor performs a periodic (or on-demand) reconciliation loop:
1. **Load Desired State**: Reads configuration from the inventory model (`hosts/*/services.yaml`).
2. **Load Actual State**: Reads the current world state from the filesystem-first observer output (`/opt/homelab/world/`).
3. **Detect Drift**: Compares the two states to identify discrepancies.
4. **Recommend Actions**: Uses a recommendation engine to propose remediations.
5. **Emit Events**: Publishes reconciliation events to the platform's event system.
## Desired vs Actual State
- **Desired State**: Defined by the user in the inventory. It specifies which services should run on which nodes.
- **Actual State (World State)**: Produced by the observer runtime. It represents the ground truth of what is currently happening in the physical and virtual infrastructure.
The gap between these two states is known as **Drift**.
## Drift Conditions
The supervisor detects the following drift conditions:
- **Missing Service**: A service is defined in the inventory but not found in the world state.
- **Unhealthy Service**: A service is present but reporting a non-ok status.
- **Failed Deployment**: Recent deployment attempts for a service have failed.
- **Offline Node**: A node defined in the inventory is unreachable or inactive.
- **Unresolved Incidents**: Active incidents in the world state that require attention.
## Recommendation Engine
Based on the detected drift, the supervisor emits one of three reconciliation event types:
- `reconcile_required`: Immediate action recommended to restore service (e.g., redeploy unhealthy service).
- `reconcile_recommended`: Action suggested to improve stability or resolve minor issues (e.g., review unresolved incidents).
- `reconcile_blocked`: Action required but blocked by external factors or repeated failures (e.g., repeated deployment failures requiring manual diagnostics).
### Examples:
- **Unhealthy service** -> Recommend `redeploy`.
- **Repeated deployment failures** -> Recommend `diagnostics`.
- **Node offline** -> Recommend `failover review`.
- **Dependency unavailable** -> Recommend `delayed deployment`.
## Summary States
The supervisor extends the platform's runtime summary states:
- `nominal`: Desired and actual states are in sync.
- `degraded`: Non-critical drift detected; reconciliation required.
- `unstable`: Critical drift or blocked reconciliation detected.
- `reconciling`: (Future) Remediations are actively being applied.
## Future Autonomous Remediation Architecture
While the current stage is recommendation-only, the architecture is designed for future autonomy:
1. **Policy Engine**: A future component will ingest reconciliation recommendations and apply policies to decide whether to auto-remediate.
2. **Executor**: A mutation-capable runtime that will execute the proposed actions (restarts, redeploys, failovers).
3. **Closed-Loop Feedback**: The observer will immediately reflect the results of remediations, allowing the supervisor to verify success or escalate if the drift persists.
4. **Guardrails**: Implementation of rate-limiting, maintenance windows, and manual overrides to ensure autonomous actions remain safe.
## Filesystem-First Design
The supervisor adheres to the platform's filesystem-first philosophy:
- **Input**: Files in `/opt/homelab/world/` and `hosts/`.
- **Output**: Append-only logs in `/tmp/agent-events.log`.
- **Persistence**: Checkpoints stored in `/tmp/supervisor-checkpoint.json`.
- **Idempotency**: Every run is independent and produces the same recommendations for the same input state.