Recovery from bad merge of task/observer-poison-quarantine (c255a02)
which carried false deletes from a stale branch base. Re-applies only
the genuine observer changes on top of correct master state.
When an event file fails to parse (malformed JSON, truncated, corrupted),
the observer previously kept retrying on every cycle while the node's
checkpoint stayed pinned — all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures are logged but don't crash the loop.
Complementary to the atomic_write_json fix on state files; this addresses
the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
rsync fails with "No such file or directory" when intermediate dirs
don't exist. /opt/homelab/deploy/ is not created by 20-base.sh.
Add rrun mkdir -p before rsync_dir; pi owns /opt/homelab so no sudo.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Push-based deploy step for LUSTRO (git_control=false): rsync
services/node-agent/ and the host override to /opt/homelab/deploy/node-agent/
on the remote, then docker compose up --build via SSH.
Guard by effect: skip push+build+up if node-agent container already running
(docker ps filter, not command -v). Verify: container running + events appear
in /opt/homelab/events/lustro/ within 90 s (confirms agent write path).
Override (hosts/lustro/runtime/node-agent/docker-compose.override.yml):
- group_add: ["991"] (docker GID on LUSTRO; 999 from base concatenated — harmless)
- mem_limit: 256m (MagicMirror ~1.9 GiB; agent must be bounded)
- /home/pi/.ssh:/root/.ssh:ro (not /home/oskar/.ssh — pi user)
- /opt/homelab/deploy/node-agent:/repo:ro (no repo checkout on push-based node)
- NODE_NAME=lustro, NODE_TYPE=sd_card, VPS_EVENTS_HOST=100.95.58.48
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stary guard porównywał literał konfigu (SIZE=) zamiast sprawdzać efekt.
Ręcznie postawiony zram był pomijany (dpkg -l vs command -v) i config
był nadpisywany niepotrzebnie.
- Guard by effect: sudo swapon --show | grep /dev/zram + dphys nieaktywny
→ cała sekcja skip bez wchodzenia w substages
- Detekcja pakietu przez dpkg -l zram-tools (nie command -v zramswap — PATH)
- Config: PERCENT=50 (skaluje z RAM) zamiast SIZE=; printf '%s\n' | sudo tee
- Wszystkie weryfikacje zram przez sudo swapon --show (nie zramctl)
- Usuń parsowanie hardware.swap.mb (nieużywane po przejściu na PERCENT)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three idempotent stages with guards (probe-before-mutate), rrun() for all
remote mutations, rprobe() for unconditional state queries. Reads
hardware.swap.mb from node.yaml (default 2048 MB). Adds swap.mb: 2048
to hosts/lustro/node.yaml so the value is declarative.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scripts/onboard/README.md (new):
- Tool purpose and --node/--step/--from/--dry-run usage
- Full node.yaml field schema with annotations (ssh_user uid-1000
gotcha, first_contact IP vs .local, deploy_autonomy/git_control gates)
- Step status table (00-access DONE, 00-preflight SCAFFOLD, 10-50 TODO)
- lib/ architecture: run() dry-run convention, yaml_get fallback caveats
- Gotchas/Learnings table from session
CLAUDE.md:
- Node Onboarding section: onboard.sh commands, pointer to README
- Multi-agent worktree mode: add explicit DISCIPLINE RULE — feature
work must happen in agent.sh worktrees, not the main checkout;
references the 2026-06-08 session that violated this
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On first SSH to a new mesh hostname, OpenSSH emits
"Warning: Permanently added 'lustro' to the list of known hosts"
on stderr. The previous code used 2>&1, merging it into the captured
arch variable, which caused the arch assertion to fail with
arch="Warning:Permanentlyadded...".
Fix:
- Add dedicated _TS_SSH opts array with -o LogLevel=ERROR, which
suppresses INFO-level messages (known-hosts, banner) at source
- Remove 2>&1 — stderr is no longer merged into the captured value
- Run only `uname -m` instead of `echo ok && uname -m`; take the last
non-empty stdout line to be robust against any remaining preamble
- Change arch mismatch from warn to die in live mode (warn in dry-run)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs in the grep+sed fallback (triggered when yq is unavailable):
1. Greedy colon match: `s/.*: *//` consumed the *last* `: ` in the line, so
values containing a colon (e.g. `systemd:magicmirror.service`) were
silently truncated to the portion after the last colon.
Fix: `s/^[[:space:]]*[^:]*:[[:space:]]*//' — anchored at line start,
key chars are `[^:]*` (no colons), so only the first `: ` separator is removed.
2. Inline YAML comment not stripped: `first_contact: pi@pimirror2.local # ...`
returned the full tail including `#`, breaking callers like ssh-copy-id.
Fix: add `s/[[:space:]]\+#.*$//` — requires at least one space before `#`
to preserve bare `#` characters inside a value.
Also add leading/trailing whitespace trim as a separate pass.
Both bugs affect any node.yaml field that has an inline comment or a colon
in its value; all ten fields in hosts/lustro/node.yaml now parse correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DRY_RUN now uses 1/0 instead of "true"/"false" across all onboard scripts.
common.sh: add run() — wraps mutations; prints "[dry-run] would: ..." when
DRY_RUN=1. Exported via `export -f run` so child bash processes inherit it.
onboard.sh: remove the `--dry-run → dryrun "Would execute" → continue` bypass.
Steps now always execute; DRY_RUN=1 is exported so each step's own run()
calls handle simulation. The orchestrator no longer needs to know step internals.
remote.sh: update DRY_RUN checks to [ "${DRY_RUN:-0}" = 1 ] for consistency.
00-access.sh: remove all if/else DRY_RUN blocks; replace with:
- Mutations (ssh-copy-id, curl install, tailscale up) wrapped in run()
- Probes (SSH BatchMode test, command -v, _ts_state) run unconditionally
so dry-run reports real current state ("key present → skip" vs "would: ...")
- Stage 3 verify runs always; SSH failure is die in live mode, warn in
dry-run (Tailscale not yet joined is expected on a fresh node)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
00-access.sh implements a 3-stage idempotent access bootstrap:
1. ensure_ssh_key — ssh-copy-id to first_contact (pi@pimirror2.local),
skips if BatchMode key-auth already passes
2. ensure_tailscale — install via install.sh if missing, then tailscale up
--hostname=lustro; prints interactive auth URL to operator, blocks until
authenticated; skips if BackendState already Running
3. verify — SSH over Tailscale to pi@lustro, asserts 'ok' + arch=aarch64
Reads first_contact and tailscale.hostname from node.yaml.
Respects --dry-run. No NOPASSWD or /opt/homelab mutations.
hosts/lustro/node.yaml: fill known hardware facts (arm64, 4096 MB RAM,
zram swap, docker_present, mm_runtime=systemd:magicmirror.service),
add ssh_user=pi, first_contact=pi@pimirror2.local,
services.node-agent.runtime engine=docker mem_limit=256m.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
grep -cv (and grep -v) return exit code 1 when there are zero matches.
With set -euo pipefail this silently aborted the script before count
was returned — causing 'agent.sh new' to fail on a fresh repo with no
existing worktrees.
Fix: move the grep -v into worktree_paths with '|| true' so the
function always exits 0, then derive worktree_count via wc -l.
Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail
was silently exiting after the loop because the failing-test compound
propagated exit code 1 through the function return.
Add ERR trap so future silent fails get diagnosed at the source.
Replaces the per-node staged framework with a single entry point that
runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest +
docker build per service), execute (control-plane.sh --ssh or remote
deploy-node.sh), verify (docker ps), and one-line report.
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two root causes for stale "active" incidents on the dashboard:
1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
can be an ISO-8601 string (stability-agent via events.py) or a Unix
int (node-agent). The previous session's auto-resolve did plain
`time.time() - last_occ` which raises TypeError for strings,
silently preventing _save_world() from being called and leaving
incidents perpetually "active" on disk.
Fix: add _parse_ts(ts) -> float that handles int, float, and
ISO-8601 strings uniformly. All timestamp arithmetic now goes through
it; returns 0.0 on None / garbage to keep comparisons safe.
2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
and marks the incident "resolved" in memory, but if incidents.json was
truncated mid-write (pre-atomic-write era), the observer loaded it at
next startup with status="active" and no service entry pointing to it.
No code ever touched these orphans again.
Fix: _prune_stale_world now runs two cleanup passes each cycle:
- Case 1 (healthy-linked): service.status=="healthy" AND incident_id
still set → resolve immediately (service cannot have active incident)
- Case 2 (orphaned): active incident with no service link AND
last_occurrence > 5 min ago → resolve (5-min guard for creation race)
Both cases are wrapped in try/except so a bug here never crashes the
observer loop or blocks _save_world.
Also fixes the 7-day stale-incident prune to use _parse_ts so
ISO-string resolved_at values are handled correctly.
3. Operator UI: current_incidents() now filters to status=="active" only.
Resolved incidents were previously included in the /incidents endpoint,
making the dashboard show a wall of historical records as if active.
Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.
Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)
observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage
control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.
Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- node_agent: emit service_healthy for all running managed containers so
observer populates services.json (previously empty → supervisor flooded
action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
format from observer checkpoint (was reading old last_processed_file key,
always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
without resolving incidents (unlike service_recovered which also resolves)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old mechanism tracked a single 'last_processed_file' and used sorted
filename order to find new events. Remote nodes ship events into
subdirectories (events/piha/, events/chelsty-infra/) that sort
alphabetically BEFORE the VPS directory (events/vps/). Once the
checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events
were silently skipped forever.
New mechanism:
- node_checkpoints: {node_dir: last_processed_path}
- Each node directory has its own independent cursor
- New events = files whose path > that node's checkpoint
- Backward-compatible: old 'last_processed_file' is migrated by extracting
the node dir from the path on first load
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
Inside a Docker container this returns the 12-char container ID (e.g.
'be17cb6eb0f6'), not the host name. Observer ingested those events and
created ghost entries in world/nodes.json that never expired.
observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
from topology inventory; called on every run_once() cycle (both new-events
and idle paths). Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
so the Dashboard's System Overview cards show real numbers instead of undefined.
operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
as keyed dicts; the frontend calls .map() which requires an array. All four
functions now convert the dict to a properly-shaped list. Each item has the
fields the Nodes, Services, Topology, Deployments, and Correlation views expect
(hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
default 24). Without this, every event file ever written was returned,
including events from ghost-node deploys.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540
placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording
retention. Secrets in /opt/homelab/config/frigate/frigate.env on node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- orchestrate-deploy.sh: read nodes from inventory/topology.yaml instead of hardcoded list
- orchestrate-deploy.sh: LTE nodes (chelsty-infra, chelsty-ha) use ConnectTimeout=30, non-fatal on failure
- deploy-node.sh: service discovery falls back to services.yaml if no services.txt
- deploy-node.sh: override path corrected to hosts/<node>/runtime/<service>/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>