Oskar Kapala 3cff4db8f3 Refactor UI to Operator Control Plane and add operator documentation

Co-authored-by: Junie <junie@jetbrains.com>

2026-05-12 17:30:00 +02:00

3.9 KiB

Raw Permalink Blame History

Runtime Supervisor

The Runtime Supervisor is a reconciliation-oriented component responsible for ensuring the homelab infrastructure aligns with its desired state. It operates in a recommendation-only mode, detecting drifts and proposing actions without directly mutating the runtime.

Reconciliation Loop

The supervisor performs a periodic (or on-demand) reconciliation loop:

Load Desired State: Reads configuration from the inventory model (hosts/*/services.yaml).
Load Actual State: Reads the current world state from the filesystem-first observer output (/opt/homelab/world/).
Detect Drift: Compares the two states to identify discrepancies.
Recommend Actions: Uses a recommendation engine to propose remediations.
Emit Events: Publishes reconciliation events to the platform's event system.

Desired vs Actual State

Desired State: Defined by the user in the inventory. It specifies which services should run on which nodes.
Actual State (World State): Produced by the observer runtime. It represents the ground truth of what is currently happening in the physical and virtual infrastructure.

The gap between these two states is known as Drift.

Drift Conditions

The supervisor detects the following drift conditions:

Missing Service: A service is defined in the inventory but not found in the world state.
Unhealthy Service: A service is present but reporting a non-ok status.
Failed Deployment: Recent deployment attempts for a service have failed.
Offline Node: A node defined in the inventory is unreachable or inactive.
Unresolved Incidents: Active incidents in the world state that require attention.

Recommendation Engine

Based on the detected drift, the supervisor emits one of three reconciliation event types:

reconcile_required: Immediate action recommended to restore service (e.g., redeploy unhealthy service).
reconcile_recommended: Action suggested to improve stability or resolve minor issues (e.g., review unresolved incidents).
reconcile_blocked: Action required but blocked by external factors or repeated failures (e.g., repeated deployment failures requiring manual diagnostics).

Examples:

Unhealthy service -> Recommend redeploy.
Repeated deployment failures -> Recommend diagnostics.
Node offline -> Recommend failover review.
Dependency unavailable -> Recommend delayed deployment.

Summary States

The supervisor extends the platform's runtime summary states:

nominal: Desired and actual states are in sync.
degraded: Non-critical drift detected; reconciliation required.
unstable: Critical drift or blocked reconciliation detected.
reconciling: (Future) Remediations are actively being applied.

Future Autonomous Remediation Architecture

While the current stage is recommendation-only, the architecture is designed for future autonomy:

Policy Engine: A future component will ingest reconciliation recommendations and apply policies to decide whether to auto-remediate.
Executor: A mutation-capable runtime that will execute the proposed actions (restarts, redeploys, failovers).
Closed-Loop Feedback: The observer will immediately reflect the results of remediations, allowing the supervisor to verify success or escalate if the drift persists.
Guardrails: Implementation of rate-limiting, maintenance windows, and manual overrides to ensure autonomous actions remain safe.

Filesystem-First Design

The supervisor adheres to the platform's filesystem-first philosophy:

Input: Files in /opt/homelab/world/ and hosts/.
Output: Append-only logs in /tmp/agent-events.log.
Persistence: Checkpoints stored in /tmp/supervisor-checkpoint.json.
Idempotency: Every run is independent and produces the same recommendations for the same input state.

3.9 KiB Raw Permalink Blame History