3.9 KiB
Runtime Supervisor
The Runtime Supervisor is a reconciliation-oriented component responsible for ensuring the homelab infrastructure aligns with its desired state. It operates in a recommendation-only mode, detecting drifts and proposing actions without directly mutating the runtime.
Reconciliation Loop
The supervisor performs a periodic (or on-demand) reconciliation loop:
- Load Desired State: Reads configuration from the inventory model (
hosts/*/services.yaml). - Load Actual State: Reads the current world state from the filesystem-first observer output (
/opt/homelab/world/). - Detect Drift: Compares the two states to identify discrepancies.
- Recommend Actions: Uses a recommendation engine to propose remediations.
- Emit Events: Publishes reconciliation events to the platform's event system.
Desired vs Actual State
- Desired State: Defined by the user in the inventory. It specifies which services should run on which nodes.
- Actual State (World State): Produced by the observer runtime. It represents the ground truth of what is currently happening in the physical and virtual infrastructure.
The gap between these two states is known as Drift.
Drift Conditions
The supervisor detects the following drift conditions:
- Missing Service: A service is defined in the inventory but not found in the world state.
- Unhealthy Service: A service is present but reporting a non-ok status.
- Failed Deployment: Recent deployment attempts for a service have failed.
- Offline Node: A node defined in the inventory is unreachable or inactive.
- Unresolved Incidents: Active incidents in the world state that require attention.
Recommendation Engine
Based on the detected drift, the supervisor emits one of three reconciliation event types:
reconcile_required: Immediate action recommended to restore service (e.g., redeploy unhealthy service).reconcile_recommended: Action suggested to improve stability or resolve minor issues (e.g., review unresolved incidents).reconcile_blocked: Action required but blocked by external factors or repeated failures (e.g., repeated deployment failures requiring manual diagnostics).
Examples:
- Unhealthy service -> Recommend
redeploy. - Repeated deployment failures -> Recommend
diagnostics. - Node offline -> Recommend
failover review. - Dependency unavailable -> Recommend
delayed deployment.
Summary States
The supervisor extends the platform's runtime summary states:
nominal: Desired and actual states are in sync.degraded: Non-critical drift detected; reconciliation required.unstable: Critical drift or blocked reconciliation detected.reconciling: (Future) Remediations are actively being applied.
Future Autonomous Remediation Architecture
While the current stage is recommendation-only, the architecture is designed for future autonomy:
- Policy Engine: A future component will ingest reconciliation recommendations and apply policies to decide whether to auto-remediate.
- Executor: A mutation-capable runtime that will execute the proposed actions (restarts, redeploys, failovers).
- Closed-Loop Feedback: The observer will immediately reflect the results of remediations, allowing the supervisor to verify success or escalate if the drift persists.
- Guardrails: Implementation of rate-limiting, maintenance windows, and manual overrides to ensure autonomous actions remain safe.
Filesystem-First Design
The supervisor adheres to the platform's filesystem-first philosophy:
- Input: Files in
/opt/homelab/world/andhosts/. - Output: Append-only logs in
/tmp/agent-events.log. - Persistence: Checkpoints stored in
/tmp/supervisor-checkpoint.json. - Idempotency: Every run is independent and produces the same recommendations for the same input state.