homelab-codex-ws/docs/vps-control-plane.md
Oskar Kapala 603e10a364 docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs
docs/sessions/2026-05-27.md (new):
- Full session record: problems found, all commits shipped, end state
- Written in Polish per operator preference for session notes
- Known limitations: SLZB-06U offline, ezsp→ember migration pending

docs/observer-runtime.md:
- Document per-node checkpoint format (replaces old global checkpoint)
- Add service_healthy / service_recovered resolution behavior
- Document ghost key pruning (_prune_stale_world patterns)
- Add event type reference table (negative vs positive)

docs/vps-control-plane.md:
- Add container names and network_mode: host detail
- Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior
- Add piha agent-system materializer integration note
- Rewrite recovery section with actionable bootstrap-flood diagnosis
- Add action state machine (pending→approved→running→completed/cancelled)

docs/chelsty-runtime.md:
- Add chelsty-infra/chelsty-ha node table
- Document docker-compose v1 constraint (always use docker-compose, not docker compose)
- Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation
- Add z2m config writable requirement (EROFS failure mode documented)
- Add chelsty-ha monitor:false rationale
- Add minimal configuration.yaml template for z2m

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:18:31 +02:00

5.4 KiB

VPS Control Plane

The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: 100.95.58.48) and provides observability, automated reconciliation, and a web-based operator interface.

Architecture

The control plane consists of four core services running as a Docker Compose stack under services/control-plane/:

Container Role
control-plane-observer Synthesizes world state from events in /opt/homelab/events/
control-plane-supervisor Detects drift between desired state (hosts/*/services.yaml) and actual state (world/services.json); writes pending actions
control-plane-executor Executes approved actions from /opt/homelab/actions/approved/
control-plane-ui Web interface for system monitoring and action approval; serves port 18180

All services use filesystem-first semantics with /opt/homelab/ as the data exchange layer. All four run with network_mode: host and as UID 1000 (homelab user).

Supervisor Behavior

Desired State

Loaded from hosts/*/services.yaml each reconcile cycle. Services with monitor: false are silently skipped — use this for services without a node-agent (e.g. homeassistant on chelsty-ha).

Drift Types

  • missing_service — service is in desired state but absent from services.json
  • unhealthy_service — service exists in services.json but status != healthy

Action Types

Trigger Action type Risk
containers_not_running, mqtt_unreachable container_restart low
Any other / unknown redeploy guarded
Node disk_pressure: high disk_cleanup guarded

Action ID Stability

Action IDs are deterministic: redeploy-{node}-{service} or container-restart-{node}-{service}. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.

Auto-Cancel

Pending redeploy and container_restart actions are automatically moved to cancelled/ when:

  • drift_resolved_auto — the service becomes healthy in actual state
  • service_removed_from_desired_state — the service was removed from services.yaml or marked monitor: false

Only pending actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.

Node Name Resolution

The supervisor supports a NODE_ALIAS_MAP environment variable (JSON string) to map event/world-state node names to canonical topology names:

NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'

Deployment

From SATURN (primary control node)

# Full deploy via SSH
./scripts/deploy/deploy-control-plane.sh --ssh

# Or manually:
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"

Direct on VPS

cd ~/homelab-codex-ws/services/control-plane
docker compose up -d --build --force-recreate

deploy-local.sh also creates the required /opt/homelab/ directory structure and sets ownership to UID 1000 (requires sudo). If directories already exist, skip to the docker compose step directly.

Verification

# On VPS
docker ps --filter "name=control-plane"
curl -s http://localhost:18180/summary | python3 -m json.tool

Action Approval Workflow

Supervisor writes → /opt/homelab/actions/pending/<id>.json
                 → Operator UI (port 18180) or Telegram Bot notifies
                 → Operator clicks Approve
                 → /opt/homelab/actions/approved/<id>.json
                 → Executor executes → completed / failed

Possible action states: pending → approved → running → completed / failed / rejected
Auto-cancel path: pending → cancelled/

Recovery

World state is stale or corrupt

# On VPS — delete checkpoint to force full replay
rm /opt/homelab/state/observer_checkpoint.json
docker restart control-plane-observer

Flood of pending actions after bootstrap

Check if node-agent is running and emitting service_healthy events on each node. Without service_healthy, the supervisor sees all services as missing and queues redeployments every cycle.

# Check node-agent on each node
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"

Rebuild from scratch

ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"

Integration

piha agent-system webui (port 18180 on piha)

The agent-system-runtime-materializer on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local /opt/homelab/world/. This ensures the "Copy for AI" button in the piha webui (agent-system-webui) reflects the same clean state as the VPS API.

Override: hosts/piha/runtime/agent-system/docker-compose.override.yml — sets CONTROL_PLANE_URL=http://100.95.58.48:18180.

Nginx Proxy Manager

The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.

Log Locations

  • Container logs: docker compose logs -f (from services/control-plane/)
  • Runtime events: /opt/homelab/events/YYYY-MM-DD/
  • World state: /opt/homelab/world/
  • Action queue: /opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/