# Runtime Supervisor The Runtime Supervisor is a reconciliation-oriented component responsible for ensuring the homelab infrastructure aligns with its desired state. It operates in a recommendation-only mode, detecting drifts and proposing actions without directly mutating the runtime. ## Reconciliation Loop The supervisor performs a periodic (or on-demand) reconciliation loop: 1. **Load Desired State**: Reads configuration from the inventory model (`hosts/*/services.yaml`). 2. **Load Actual State**: Reads the current world state from the filesystem-first observer output (`/opt/homelab/world/`). 3. **Detect Drift**: Compares the two states to identify discrepancies. 4. **Recommend Actions**: Uses a recommendation engine to propose remediations. 5. **Emit Events**: Publishes reconciliation events to the platform's event system. ## Desired vs Actual State - **Desired State**: Defined by the user in the inventory. It specifies which services should run on which nodes. - **Actual State (World State)**: Produced by the observer runtime. It represents the ground truth of what is currently happening in the physical and virtual infrastructure. The gap between these two states is known as **Drift**. ## Drift Conditions The supervisor detects the following drift conditions: - **Missing Service**: A service is defined in the inventory but not found in the world state. - **Unhealthy Service**: A service is present but reporting a non-ok status. - **Failed Deployment**: Recent deployment attempts for a service have failed. - **Offline Node**: A node defined in the inventory is unreachable or inactive. - **Unresolved Incidents**: Active incidents in the world state that require attention. ## Recommendation Engine Based on the detected drift, the supervisor emits one of three reconciliation event types: - `reconcile_required`: Immediate action recommended to restore service (e.g., redeploy unhealthy service). - `reconcile_recommended`: Action suggested to improve stability or resolve minor issues (e.g., review unresolved incidents). - `reconcile_blocked`: Action required but blocked by external factors or repeated failures (e.g., repeated deployment failures requiring manual diagnostics). ### Examples: - **Unhealthy service** -> Recommend `redeploy`. - **Repeated deployment failures** -> Recommend `diagnostics`. - **Node offline** -> Recommend `failover review`. - **Dependency unavailable** -> Recommend `delayed deployment`. ## Summary States The supervisor extends the platform's runtime summary states: - `nominal`: Desired and actual states are in sync. - `degraded`: Non-critical drift detected; reconciliation required. - `unstable`: Critical drift or blocked reconciliation detected. - `reconciling`: (Future) Remediations are actively being applied. ## Future Autonomous Remediation Architecture While the current stage is recommendation-only, the architecture is designed for future autonomy: 1. **Policy Engine**: A future component will ingest reconciliation recommendations and apply policies to decide whether to auto-remediate. 2. **Executor**: A mutation-capable runtime that will execute the proposed actions (restarts, redeploys, failovers). 3. **Closed-Loop Feedback**: The observer will immediately reflect the results of remediations, allowing the supervisor to verify success or escalate if the drift persists. 4. **Guardrails**: Implementation of rate-limiting, maintenance windows, and manual overrides to ensure autonomous actions remain safe. ## Filesystem-First Design The supervisor adheres to the platform's filesystem-first philosophy: - **Input**: Files in `/opt/homelab/world/` and `hosts/`. - **Output**: Append-only logs in `/tmp/agent-events.log`. - **Persistence**: Checkpoints stored in `/tmp/supervisor-checkpoint.json`. - **Idempotency**: Every run is independent and produces the same recommendations for the same input state.