diff --git a/docs/operator/incident-review.md b/docs/operator/incident-review.md new file mode 100644 index 0000000..3ca7916 --- /dev/null +++ b/docs/operator/incident-review.md @@ -0,0 +1,9 @@ +# Incident Review Flow + +When an incident occurs: +1. Identify the incident in the Dashboard or Events view. +2. Note the severity and the affected node/service. +3. Switch to the Events view to see the correlation chain. +4. Look for related events preceding the incident (e.g., latency spikes before a disconnect). +5. Check the Recommendations view for suggested fixes. +6. Once resolved, the incident will clear from the Active Incidents list. diff --git a/docs/operator/reconcile-review.md b/docs/operator/reconcile-review.md new file mode 100644 index 0000000..8b99c6f --- /dev/null +++ b/docs/operator/reconcile-review.md @@ -0,0 +1,12 @@ +# Reconcile Review Flow + +The system continuously monitors for drift between desired and actual state. + +1. If a service is in RECONCILING state, check the Services view. +2. Review the Recommendations view for automated or guarded actions. +3. For 'safe' actions with high confidence, the system may act autonomously if enabled. +4. For 'guarded' or 'dangerous' actions, an operator must manually approve the action. +5. Risk Levels: + - **Safe**: Minimal impact, high success rate. + - **Guarded**: Potential brief service interruption. + - **Dangerous**: Significant impact, potential data loss, or hardware interaction required. diff --git a/docs/operator/workflow-examples.md b/docs/operator/workflow-examples.md new file mode 100644 index 0000000..34825ff --- /dev/null +++ b/docs/operator/workflow-examples.md @@ -0,0 +1,13 @@ +# Operator Workflow Examples + +## Daily Check +1. Open the Operator Control Plane. +2. Check the System Status in the sidebar (should be NOMINAL). +3. Review the Dashboard for any active incidents. +4. Switch to Nodes view to ensure all nodes are connected and healthy. + +## New Service Deployment +1. Trigger deployment via CLI/API. +2. Monitor progress in the Deployments view. +3. If a stage fails, check the Diagnostics field in the deployment card. +4. Once stable, verify the service health in the Services view. diff --git a/docs/runtime-supervisor.md b/docs/runtime-supervisor.md new file mode 100644 index 0000000..abe2eda --- /dev/null +++ b/docs/runtime-supervisor.md @@ -0,0 +1,70 @@ +# Runtime Supervisor + +The Runtime Supervisor is a reconciliation-oriented component responsible for ensuring the homelab infrastructure aligns with its desired state. It operates in a recommendation-only mode, detecting drifts and proposing actions without directly mutating the runtime. + +## Reconciliation Loop + +The supervisor performs a periodic (or on-demand) reconciliation loop: + +1. **Load Desired State**: Reads configuration from the inventory model (`hosts/*/services.yaml`). +2. **Load Actual State**: Reads the current world state from the filesystem-first observer output (`/opt/homelab/world/`). +3. **Detect Drift**: Compares the two states to identify discrepancies. +4. **Recommend Actions**: Uses a recommendation engine to propose remediations. +5. **Emit Events**: Publishes reconciliation events to the platform's event system. + +## Desired vs Actual State + +- **Desired State**: Defined by the user in the inventory. It specifies which services should run on which nodes. +- **Actual State (World State)**: Produced by the observer runtime. It represents the ground truth of what is currently happening in the physical and virtual infrastructure. + +The gap between these two states is known as **Drift**. + +## Drift Conditions + +The supervisor detects the following drift conditions: + +- **Missing Service**: A service is defined in the inventory but not found in the world state. +- **Unhealthy Service**: A service is present but reporting a non-ok status. +- **Failed Deployment**: Recent deployment attempts for a service have failed. +- **Offline Node**: A node defined in the inventory is unreachable or inactive. +- **Unresolved Incidents**: Active incidents in the world state that require attention. + +## Recommendation Engine + +Based on the detected drift, the supervisor emits one of three reconciliation event types: + +- `reconcile_required`: Immediate action recommended to restore service (e.g., redeploy unhealthy service). +- `reconcile_recommended`: Action suggested to improve stability or resolve minor issues (e.g., review unresolved incidents). +- `reconcile_blocked`: Action required but blocked by external factors or repeated failures (e.g., repeated deployment failures requiring manual diagnostics). + +### Examples: +- **Unhealthy service** -> Recommend `redeploy`. +- **Repeated deployment failures** -> Recommend `diagnostics`. +- **Node offline** -> Recommend `failover review`. +- **Dependency unavailable** -> Recommend `delayed deployment`. + +## Summary States + +The supervisor extends the platform's runtime summary states: + +- `nominal`: Desired and actual states are in sync. +- `degraded`: Non-critical drift detected; reconciliation required. +- `unstable`: Critical drift or blocked reconciliation detected. +- `reconciling`: (Future) Remediations are actively being applied. + +## Future Autonomous Remediation Architecture + +While the current stage is recommendation-only, the architecture is designed for future autonomy: + +1. **Policy Engine**: A future component will ingest reconciliation recommendations and apply policies to decide whether to auto-remediate. +2. **Executor**: A mutation-capable runtime that will execute the proposed actions (restarts, redeploys, failovers). +3. **Closed-Loop Feedback**: The observer will immediately reflect the results of remediations, allowing the supervisor to verify success or escalate if the drift persists. +4. **Guardrails**: Implementation of rate-limiting, maintenance windows, and manual overrides to ensure autonomous actions remain safe. + +## Filesystem-First Design + +The supervisor adheres to the platform's filesystem-first philosophy: +- **Input**: Files in `/opt/homelab/world/` and `hosts/`. +- **Output**: Append-only logs in `/tmp/agent-events.log`. +- **Persistence**: Checkpoints stored in `/tmp/supervisor-checkpoint.json`. +- **Idempotency**: Every run is independent and produces the same recommendations for the same input state. diff --git a/webui/index.html b/webui/index.html index 150d978..d720307 100644 --- a/webui/index.html +++ b/webui/index.html @@ -3,558 +3,495 @@
-