# VPS Control Plane The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface. ## Architecture The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`: | Container | Role | |-----------|------| | `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` | | `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions | | `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` | | `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 | All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user). ## Supervisor Behavior ### Desired State Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`). ### Drift Types - `missing_service` — service is in desired state but absent from `services.json` - `unhealthy_service` — service exists in `services.json` but `status != healthy` ### Action Types | Trigger | Action type | Risk | |---------|-------------|------| | `containers_not_running`, `mqtt_unreachable` | `container_restart` | low | | Any other / unknown | `redeploy` | guarded | | Node `disk_pressure: high` | `disk_cleanup` | guarded | ### Action ID Stability Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts. ### Auto-Cancel Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when: - **`drift_resolved_auto`** — the service becomes `healthy` in actual state - **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false` Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically. ### Node Name Resolution The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names: ```bash NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}' ``` ## Deployment ### From SATURN (primary control node) ```bash # Full deploy via SSH ./scripts/deploy/deploy-control-plane.sh --ssh # Or manually: ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate" ``` ### Direct on VPS ```bash cd ~/homelab-codex-ws/services/control-plane docker compose up -d --build --force-recreate ``` `deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly. ### Verification ```bash # On VPS docker ps --filter "name=control-plane" curl -s http://localhost:18180/summary | python3 -m json.tool ``` ## Action Approval Workflow ``` Supervisor writes → /opt/homelab/actions/pending/.json → Operator UI (port 18180) or Telegram Bot notifies → Operator clicks Approve → /opt/homelab/actions/approved/.json → Executor executes → completed / failed ``` Possible action states: `pending → approved → running → completed / failed / rejected` Auto-cancel path: `pending → cancelled/` ## Recovery ### World state is stale or corrupt ```bash # On VPS — delete checkpoint to force full replay rm /opt/homelab/state/observer_checkpoint.json docker restart control-plane-observer ``` ### Flood of pending actions after bootstrap Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle. ```bash # Check node-agent on each node ssh oskar@ "docker ps --filter name=node-agent && docker logs node-agent --tail 20" ``` ### Rebuild from scratch ```bash ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate" ``` ## Integration ### piha agent-system webui (port 18180 on piha) The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API. Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`. ### Nginx Proxy Manager The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required. ### Log Locations - Container logs: `docker compose logs -f` (from `services/control-plane/`) - Runtime events: `/opt/homelab/events/YYYY-MM-DD/` - World state: `/opt/homelab/world/` - Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`