71 lines
3.9 KiB
Markdown
71 lines
3.9 KiB
Markdown
|
|
# Runtime Supervisor
|
||
|
|
|
||
|
|
The Runtime Supervisor is a reconciliation-oriented component responsible for ensuring the homelab infrastructure aligns with its desired state. It operates in a recommendation-only mode, detecting drifts and proposing actions without directly mutating the runtime.
|
||
|
|
|
||
|
|
## Reconciliation Loop
|
||
|
|
|
||
|
|
The supervisor performs a periodic (or on-demand) reconciliation loop:
|
||
|
|
|
||
|
|
1. **Load Desired State**: Reads configuration from the inventory model (`hosts/*/services.yaml`).
|
||
|
|
2. **Load Actual State**: Reads the current world state from the filesystem-first observer output (`/opt/homelab/world/`).
|
||
|
|
3. **Detect Drift**: Compares the two states to identify discrepancies.
|
||
|
|
4. **Recommend Actions**: Uses a recommendation engine to propose remediations.
|
||
|
|
5. **Emit Events**: Publishes reconciliation events to the platform's event system.
|
||
|
|
|
||
|
|
## Desired vs Actual State
|
||
|
|
|
||
|
|
- **Desired State**: Defined by the user in the inventory. It specifies which services should run on which nodes.
|
||
|
|
- **Actual State (World State)**: Produced by the observer runtime. It represents the ground truth of what is currently happening in the physical and virtual infrastructure.
|
||
|
|
|
||
|
|
The gap between these two states is known as **Drift**.
|
||
|
|
|
||
|
|
## Drift Conditions
|
||
|
|
|
||
|
|
The supervisor detects the following drift conditions:
|
||
|
|
|
||
|
|
- **Missing Service**: A service is defined in the inventory but not found in the world state.
|
||
|
|
- **Unhealthy Service**: A service is present but reporting a non-ok status.
|
||
|
|
- **Failed Deployment**: Recent deployment attempts for a service have failed.
|
||
|
|
- **Offline Node**: A node defined in the inventory is unreachable or inactive.
|
||
|
|
- **Unresolved Incidents**: Active incidents in the world state that require attention.
|
||
|
|
|
||
|
|
## Recommendation Engine
|
||
|
|
|
||
|
|
Based on the detected drift, the supervisor emits one of three reconciliation event types:
|
||
|
|
|
||
|
|
- `reconcile_required`: Immediate action recommended to restore service (e.g., redeploy unhealthy service).
|
||
|
|
- `reconcile_recommended`: Action suggested to improve stability or resolve minor issues (e.g., review unresolved incidents).
|
||
|
|
- `reconcile_blocked`: Action required but blocked by external factors or repeated failures (e.g., repeated deployment failures requiring manual diagnostics).
|
||
|
|
|
||
|
|
### Examples:
|
||
|
|
- **Unhealthy service** -> Recommend `redeploy`.
|
||
|
|
- **Repeated deployment failures** -> Recommend `diagnostics`.
|
||
|
|
- **Node offline** -> Recommend `failover review`.
|
||
|
|
- **Dependency unavailable** -> Recommend `delayed deployment`.
|
||
|
|
|
||
|
|
## Summary States
|
||
|
|
|
||
|
|
The supervisor extends the platform's runtime summary states:
|
||
|
|
|
||
|
|
- `nominal`: Desired and actual states are in sync.
|
||
|
|
- `degraded`: Non-critical drift detected; reconciliation required.
|
||
|
|
- `unstable`: Critical drift or blocked reconciliation detected.
|
||
|
|
- `reconciling`: (Future) Remediations are actively being applied.
|
||
|
|
|
||
|
|
## Future Autonomous Remediation Architecture
|
||
|
|
|
||
|
|
While the current stage is recommendation-only, the architecture is designed for future autonomy:
|
||
|
|
|
||
|
|
1. **Policy Engine**: A future component will ingest reconciliation recommendations and apply policies to decide whether to auto-remediate.
|
||
|
|
2. **Executor**: A mutation-capable runtime that will execute the proposed actions (restarts, redeploys, failovers).
|
||
|
|
3. **Closed-Loop Feedback**: The observer will immediately reflect the results of remediations, allowing the supervisor to verify success or escalate if the drift persists.
|
||
|
|
4. **Guardrails**: Implementation of rate-limiting, maintenance windows, and manual overrides to ensure autonomous actions remain safe.
|
||
|
|
|
||
|
|
## Filesystem-First Design
|
||
|
|
|
||
|
|
The supervisor adheres to the platform's filesystem-first philosophy:
|
||
|
|
- **Input**: Files in `/opt/homelab/world/` and `hosts/`.
|
||
|
|
- **Output**: Append-only logs in `/tmp/agent-events.log`.
|
||
|
|
- **Persistence**: Checkpoints stored in `/tmp/supervisor-checkpoint.json`.
|
||
|
|
- **Idempotency**: Every run is independent and produces the same recommendations for the same input state.
|