node-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.
- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) to retain docker.sock access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)
observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage
control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multiple service_healthy (or containers_not_running) events emitted in the
same second for different containers shared the same filename pattern
evt-{node}-{ts}-{type}.json — the second write silently overwrote the first,
so the observer only ever saw the last container checked per event type per cycle.
Fix: include a sanitized service name slug in the ID so every event gets a
unique file, e.g. evt-vps-1234-service_healthy-node-agent.json.
Also adds import re (required for re.sub in the slug generation).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- node_agent: emit service_healthy for all running managed containers so
observer populates services.json (previously empty → supervisor flooded
action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
format from observer checkpoint (was reading old last_processed_file key,
always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
without resolving incidents (unlike service_recovered which also resolves)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When ~/.ssh is mounted from the host oskar user into a container that
runs as root, OpenSSH rejects ~/.ssh/config with 'Bad owner or
permissions' because the file UID doesn't match the running process.
Add -F /dev/null to the rsync SSH command to skip the config file
entirely. Also add UserKnownHostsFile=/dev/null so no known_hosts
write is attempted into a potentially read-only mounted .ssh dir.
The key itself (/root/.ssh/id_rsa) is still read as an implicit
default identity and is not affected by -F.
Reproduces on chelsty-infra (has ~/.ssh/config); safe for all nodes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>