docs/sessions/2026-05-27.md (new): - Full session record: problems found, all commits shipped, end state - Written in Polish per operator preference for session notes - Known limitations: SLZB-06U offline, ezsp→ember migration pending docs/observer-runtime.md: - Document per-node checkpoint format (replaces old global checkpoint) - Add service_healthy / service_recovered resolution behavior - Document ghost key pruning (_prune_stale_world patterns) - Add event type reference table (negative vs positive) docs/vps-control-plane.md: - Add container names and network_mode: host detail - Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior - Add piha agent-system materializer integration note - Rewrite recovery section with actionable bootstrap-flood diagnosis - Add action state machine (pending→approved→running→completed/cancelled) docs/chelsty-runtime.md: - Add chelsty-infra/chelsty-ha node table - Document docker-compose v1 constraint (always use docker-compose, not docker compose) - Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation - Add z2m config writable requirement (EROFS failure mode documented) - Add chelsty-ha monitor:false rationale - Add minimal configuration.yaml template for z2m Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.4 KiB
VPS Control Plane
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: 100.95.58.48) and provides observability, automated reconciliation, and a web-based operator interface.
Architecture
The control plane consists of four core services running as a Docker Compose stack under services/control-plane/:
| Container | Role |
|---|---|
control-plane-observer |
Synthesizes world state from events in /opt/homelab/events/ |
control-plane-supervisor |
Detects drift between desired state (hosts/*/services.yaml) and actual state (world/services.json); writes pending actions |
control-plane-executor |
Executes approved actions from /opt/homelab/actions/approved/ |
control-plane-ui |
Web interface for system monitoring and action approval; serves port 18180 |
All services use filesystem-first semantics with /opt/homelab/ as the data exchange layer. All four run with network_mode: host and as UID 1000 (homelab user).
Supervisor Behavior
Desired State
Loaded from hosts/*/services.yaml each reconcile cycle. Services with monitor: false are silently skipped — use this for services without a node-agent (e.g. homeassistant on chelsty-ha).
Drift Types
missing_service— service is in desired state but absent fromservices.jsonunhealthy_service— service exists inservices.jsonbutstatus != healthy
Action Types
| Trigger | Action type | Risk |
|---|---|---|
containers_not_running, mqtt_unreachable |
container_restart |
low |
| Any other / unknown | redeploy |
guarded |
Node disk_pressure: high |
disk_cleanup |
guarded |
Action ID Stability
Action IDs are deterministic: redeploy-{node}-{service} or container-restart-{node}-{service}. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
Auto-Cancel
Pending redeploy and container_restart actions are automatically moved to cancelled/ when:
drift_resolved_auto— the service becomeshealthyin actual stateservice_removed_from_desired_state— the service was removed fromservices.yamlor markedmonitor: false
Only pending actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
Node Name Resolution
The supervisor supports a NODE_ALIAS_MAP environment variable (JSON string) to map event/world-state node names to canonical topology names:
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
Deployment
From SATURN (primary control node)
# Full deploy via SSH
./scripts/deploy/deploy-control-plane.sh --ssh
# Or manually:
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
Direct on VPS
cd ~/homelab-codex-ws/services/control-plane
docker compose up -d --build --force-recreate
deploy-local.sh also creates the required /opt/homelab/ directory structure and sets ownership to UID 1000 (requires sudo). If directories already exist, skip to the docker compose step directly.
Verification
# On VPS
docker ps --filter "name=control-plane"
curl -s http://localhost:18180/summary | python3 -m json.tool
Action Approval Workflow
Supervisor writes → /opt/homelab/actions/pending/<id>.json
→ Operator UI (port 18180) or Telegram Bot notifies
→ Operator clicks Approve
→ /opt/homelab/actions/approved/<id>.json
→ Executor executes → completed / failed
Possible action states: pending → approved → running → completed / failed / rejected
Auto-cancel path: pending → cancelled/
Recovery
World state is stale or corrupt
# On VPS — delete checkpoint to force full replay
rm /opt/homelab/state/observer_checkpoint.json
docker restart control-plane-observer
Flood of pending actions after bootstrap
Check if node-agent is running and emitting service_healthy events on each node. Without service_healthy, the supervisor sees all services as missing and queues redeployments every cycle.
# Check node-agent on each node
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
Rebuild from scratch
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
Integration
piha agent-system webui (port 18180 on piha)
The agent-system-runtime-materializer on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local /opt/homelab/world/. This ensures the "Copy for AI" button in the piha webui (agent-system-webui) reflects the same clean state as the VPS API.
Override: hosts/piha/runtime/agent-system/docker-compose.override.yml — sets CONTROL_PLANE_URL=http://100.95.58.48:18180.
Nginx Proxy Manager
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
Log Locations
- Container logs:
docker compose logs -f(fromservices/control-plane/) - Runtime events:
/opt/homelab/events/YYYY-MM-DD/ - World state:
/opt/homelab/world/ - Action queue:
/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/