The _check_control_plane_health() method probes localhost:18180, which
is the control-plane's mapped port. Inside a bridged container, localhost
resolves to the container's own loopback — the probe always fails.
host network mode shares the VPS host's network namespace so that
localhost:18180 correctly reaches the control-plane.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- node_agent: emit service_healthy for all running managed containers so
observer populates services.json (previously empty → supervisor flooded
action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
format from observer checkpoint (was reading old last_processed_file key,
always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
without resolving incidents (unlike service_recovered which also resolves)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old mechanism tracked a single 'last_processed_file' and used sorted
filename order to find new events. Remote nodes ship events into
subdirectories (events/piha/, events/chelsty-infra/) that sort
alphabetically BEFORE the VPS directory (events/vps/). Once the
checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events
were silently skipped forever.
New mechanism:
- node_checkpoints: {node_dir: last_processed_path}
- Each node directory has its own independent cursor
- New events = files whose path > that node's checkpoint
- Backward-compatible: old 'last_processed_file' is migrated by extracting
the node dir from the path on first load
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When ~/.ssh is mounted from the host oskar user into a container that
runs as root, OpenSSH rejects ~/.ssh/config with 'Bad owner or
permissions' because the file UID doesn't match the running process.
Add -F /dev/null to the rsync SSH command to skip the config file
entirely. Also add UserKnownHostsFile=/dev/null so no known_hosts
write is attempted into a potentially read-only mounted .ssh dir.
The key itself (/root/.ssh/id_rsa) is still read as an implicit
default identity and is not affected by -F.
Reproduces on chelsty-infra (has ~/.ssh/config); safe for all nodes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
100.108.208.3 is piha's Tailscale IP (piha hosts Forgejo+Redis).
VPS's actual Tailscale IP is 100.95.58.48. All three node-agent
overrides were pointing at piha itself, causing containers to SSH
to their own host and fail auth.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Nodes ship events to VPS via rsync+SSH. The container runs as root
and uses the default SSH identity, which must be at /root/.ssh/.
Mount /home/oskar/.ssh from the host read-only so the existing
authorized key is available inside the container.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
Inside a Docker container this returns the 12-char container ID (e.g.
'be17cb6eb0f6'), not the host name. Observer ingested those events and
created ghost entries in world/nodes.json that never expired.
observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
from topology inventory; called on every run_once() cycle (both new-events
and idle paths). Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
so the Dashboard's System Overview cards show real numbers instead of undefined.
operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
as keyed dicts; the frontend calls .map() which requires an array. All four
functions now convert the dict to a properly-shaped list. Each item has the
fields the Nodes, Services, Topology, Deployments, and Correlation views expect
(hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
default 24). Without this, every event file ever written was returned,
including events from ghost-node deploys.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- piha: NODE_TYPE=sd_card (rate-limited docker prune, once per day)
- solaria: NODE_TYPE=ai_node (dangling+containers+build cache; never -a to preserve Ollama images)
- chelsty-infra: NODE_TYPE=lte_node (NO cleanup, events-only)
- All three: VPS_EVENTS_HOST set for event shipping via rsync+SSH
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540
placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording
retention. Secrets in /opt/homelab/config/frigate/frigate.env on node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table
- correct docker-compose override path to hosts/<node>/runtime/<service>/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- orchestrate-deploy.sh: read nodes from inventory/topology.yaml instead of hardcoded list
- orchestrate-deploy.sh: LTE nodes (chelsty-infra, chelsty-ha) use ConnectTimeout=30, non-fatal on failure
- deploy-node.sh: service discovery falls back to services.yaml if no services.txt
- deploy-node.sh: override path corrected to hosts/<node>/runtime/<service>/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>