Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.
Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
Port 8087 is no longer mapped to the host. Operator verify commands
that used curl http://localhost:8087/health now use docker exec with
Python's urllib (the image is python:3.11-slim, no curl binary).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port 8087 conflicted with zigbee2mqtt on piha (8087:8080 mapping active
for 7+ days), preventing ha-diag-agent from starting.
Grep across the full repo confirms no external consumer (no nginx/npm
proxy, no Prometheus scrape, no control-plane reference) uses this port.
The Docker healthcheck runs inside the container network namespace and
does not require a host-side mapping. Internal FastAPI binding on 8087
is unchanged.
Removed: ports section from docker-compose.yml and service.yaml.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- main.py: rename event= to ha_event= in _log.warning() — structlog treats
'event' as a reserved positional arg; the old name caused TypeError when
any check returned unhealthy results (events were still emitted, but the
check was logged as check_error instead of check_unhealthy)
- tests/test_ha_client.py: replace aioresponses with unittest.mock — aioresponses
0.7.8 is incompatible with aiohttp >=3.12 (missing stream_writer kwarg)
- pyproject.toml: remove aioresponses from dev dependencies
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent runs as uid 1000 (homelab) since the base compose sets
user "1000:1000"; ssh in _ship_events_to_vps() has no -i flag and looks
for keys in $HOME/.ssh = /home/homelab/.ssh. The old mount target
/root/.ssh was never consulted, so rsync to VPS failed with
'Permission denied'. uid match (pi=1000 on RPi OS) keeps OpenSSH strict
ownership checks happy.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
rsync fails with "No such file or directory" when intermediate dirs
don't exist. /opt/homelab/deploy/ is not created by 20-base.sh.
Add rrun mkdir -p before rsync_dir; pi owns /opt/homelab so no sudo.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Push-based deploy step for LUSTRO (git_control=false): rsync
services/node-agent/ and the host override to /opt/homelab/deploy/node-agent/
on the remote, then docker compose up --build via SSH.
Guard by effect: skip push+build+up if node-agent container already running
(docker ps filter, not command -v). Verify: container running + events appear
in /opt/homelab/events/lustro/ within 90 s (confirms agent write path).
Override (hosts/lustro/runtime/node-agent/docker-compose.override.yml):
- group_add: ["991"] (docker GID on LUSTRO; 999 from base concatenated — harmless)
- mem_limit: 256m (MagicMirror ~1.9 GiB; agent must be bounded)
- /home/pi/.ssh:/root/.ssh:ro (not /home/oskar/.ssh — pi user)
- /opt/homelab/deploy/node-agent:/repo:ro (no repo checkout on push-based node)
- NODE_NAME=lustro, NODE_TYPE=sd_card, VPS_EVENTS_HOST=100.95.58.48
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stary guard porównywał literał konfigu (SIZE=) zamiast sprawdzać efekt.
Ręcznie postawiony zram był pomijany (dpkg -l vs command -v) i config
był nadpisywany niepotrzebnie.
- Guard by effect: sudo swapon --show | grep /dev/zram + dphys nieaktywny
→ cała sekcja skip bez wchodzenia w substages
- Detekcja pakietu przez dpkg -l zram-tools (nie command -v zramswap — PATH)
- Config: PERCENT=50 (skaluje z RAM) zamiast SIZE=; printf '%s\n' | sudo tee
- Wszystkie weryfikacje zram przez sudo swapon --show (nie zramctl)
- Usuń parsowanie hardware.swap.mb (nieużywane po przejściu na PERCENT)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three idempotent stages with guards (probe-before-mutate), rrun() for all
remote mutations, rprobe() for unconditional state queries. Reads
hardware.swap.mb from node.yaml (default 2048 MB). Adds swap.mb: 2048
to hosts/lustro/node.yaml so the value is declarative.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ECC-format skill for the node onboarding workflow. Covers full step
sequence, operational rules, node.yaml key fields, gotchas from LUSTRO
session, and Definition of Done. Marked as living doc — SCAFFOLD sections
to be promoted to PROVEN as steps land on real nodes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
scripts/onboard/README.md (new):
- Tool purpose and --node/--step/--from/--dry-run usage
- Full node.yaml field schema with annotations (ssh_user uid-1000
gotcha, first_contact IP vs .local, deploy_autonomy/git_control gates)
- Step status table (00-access DONE, 00-preflight SCAFFOLD, 10-50 TODO)
- lib/ architecture: run() dry-run convention, yaml_get fallback caveats
- Gotchas/Learnings table from session
CLAUDE.md:
- Node Onboarding section: onboard.sh commands, pointer to README
- Multi-agent worktree mode: add explicit DISCIPLINE RULE — feature
work must happen in agent.sh worktrees, not the main checkout;
references the 2026-06-08 session that violated this
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Records the onboarding session for LUSTRO (RPi4, KEN site):
node facts from preflight, key decisions (user pi/uid-1000, IP
over mDNS, zram target), 00-access status, tool bugs fixed
(dry-run propagation, yaml_get greedy-colon + inline comment,
ssh known-hosts in verify), open items for next session
(worktree hygiene first, bootstrap-runtime, node-agent, register,
verify, mm-watch).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On first SSH to a new mesh hostname, OpenSSH emits
"Warning: Permanently added 'lustro' to the list of known hosts"
on stderr. The previous code used 2>&1, merging it into the captured
arch variable, which caused the arch assertion to fail with
arch="Warning:Permanentlyadded...".
Fix:
- Add dedicated _TS_SSH opts array with -o LogLevel=ERROR, which
suppresses INFO-level messages (known-hosts, banner) at source
- Remove 2>&1 — stderr is no longer merged into the captured value
- Run only `uname -m` instead of `echo ok && uname -m`; take the last
non-empty stdout line to be robust against any remaining preamble
- Change arch mismatch from warn to die in live mode (warn in dry-run)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs in the grep+sed fallback (triggered when yq is unavailable):
1. Greedy colon match: `s/.*: *//` consumed the *last* `: ` in the line, so
values containing a colon (e.g. `systemd:magicmirror.service`) were
silently truncated to the portion after the last colon.
Fix: `s/^[[:space:]]*[^:]*:[[:space:]]*//' — anchored at line start,
key chars are `[^:]*` (no colons), so only the first `: ` separator is removed.
2. Inline YAML comment not stripped: `first_contact: pi@pimirror2.local # ...`
returned the full tail including `#`, breaking callers like ssh-copy-id.
Fix: add `s/[[:space:]]\+#.*$//` — requires at least one space before `#`
to preserve bare `#` characters inside a value.
Also add leading/trailing whitespace trim as a separate pass.
Both bugs affect any node.yaml field that has an inline comment or a colon
in its value; all ten fields in hosts/lustro/node.yaml now parse correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DRY_RUN now uses 1/0 instead of "true"/"false" across all onboard scripts.
common.sh: add run() — wraps mutations; prints "[dry-run] would: ..." when
DRY_RUN=1. Exported via `export -f run` so child bash processes inherit it.
onboard.sh: remove the `--dry-run → dryrun "Would execute" → continue` bypass.
Steps now always execute; DRY_RUN=1 is exported so each step's own run()
calls handle simulation. The orchestrator no longer needs to know step internals.
remote.sh: update DRY_RUN checks to [ "${DRY_RUN:-0}" = 1 ] for consistency.
00-access.sh: remove all if/else DRY_RUN blocks; replace with:
- Mutations (ssh-copy-id, curl install, tailscale up) wrapped in run()
- Probes (SSH BatchMode test, command -v, _ts_state) run unconditionally
so dry-run reports real current state ("key present → skip" vs "would: ...")
- Stage 3 verify runs always; SSH failure is die in live mode, warn in
dry-run (Tailscale not yet joined is expected on a fresh node)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
00-access.sh implements a 3-stage idempotent access bootstrap:
1. ensure_ssh_key — ssh-copy-id to first_contact (pi@pimirror2.local),
skips if BatchMode key-auth already passes
2. ensure_tailscale — install via install.sh if missing, then tailscale up
--hostname=lustro; prints interactive auth URL to operator, blocks until
authenticated; skips if BackendState already Running
3. verify — SSH over Tailscale to pi@lustro, asserts 'ok' + arch=aarch64
Reads first_contact and tailscale.hostname from node.yaml.
Respects --dry-run. No NOPASSWD or /opt/homelab mutations.
hosts/lustro/node.yaml: fill known hardware facts (arm64, 4096 MB RAM,
zram swap, docker_present, mm_runtime=systemd:magicmirror.service),
add ssh_user=pi, first_contact=pi@pimirror2.local,
services.node-agent.runtime engine=docker mem_limit=256m.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stability-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.
- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) to retain docker.sock:ro access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.
- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) to retain docker.sock access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Executor was the only control-plane container running as root (uid=0),
writing root-owned files to /opt/homelab via bind-mount and triggering
false sudo on every deploy.
- Dockerfile: add USER homelab after useradd (useradd already present)
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) so executor retains docker.sock access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
grep -cv (and grep -v) return exit code 1 when there are zero matches.
With set -euo pipefail this silently aborted the script before count
was returned — causing 'agent.sh new' to fail on a fresh repo with no
existing worktrees.
Fix: move the grep -v into worktree_paths with '|| true' so the
function always exits 0, then derive worktree_count via wc -l.
Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail
was silently exiting after the loop because the failing-test compound
propagated exit code 1 through the function return.
Add ERR trap so future silent fails get diagnosed at the source.
Encodes branch hygiene for CC running in task worktrees: commit only to
assigned branch, no push origin master, no touching main checkout, no
git add -A, no worktree management, mandatory final report.
Records session facts (git log, diff --stat, deploys from transcript)
by appending to docs/sessions/YYYY-MM-DD.md with a mandatory narrative
placeholder. Never touches backlog.md or CLAUDE.md without explicit
instruction. Commits only the session file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instructs CC to always route deploy/redeploy/ship/wdróż requests through
scripts/deploy/deploy.sh, maps exit codes to required actions, and
enforces no-bypass rules for gate and branch checks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the per-node staged framework with a single entry point that
runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest +
docker build per service), execute (control-plane.sh --ssh or remote
deploy-node.sh), verify (docker ps), and one-line report.
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deploy-local.sh previously ran `sudo chown -R 1000:1000` and
`sudo chmod -R 775` unconditionally on every deploy, which blocked
non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000.
Both steps are now conditional using `find ... -print -quit`:
- chown: runs only if any file/dir is NOT uid/gid 1000
- chmod: runs only if any directory is missing -775 permission bits
When everything is correct (steady state on VPS), both steps log
"already correct, skipping" and never invoke sudo. If a new directory
was created by root (e.g. a manual mkdir, volume mount, or restart artefact),
the remediation path triggers automatically — the self-heal property is preserved.
Smoke-tested in Docker (ubuntu:22.04):
Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓
Case 2 (root-owned subdir): chown triggered ✓
Case 3 (700 dir perms): chmod triggered ✓
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two root causes for stale "active" incidents on the dashboard:
1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
can be an ISO-8601 string (stability-agent via events.py) or a Unix
int (node-agent). The previous session's auto-resolve did plain
`time.time() - last_occ` which raises TypeError for strings,
silently preventing _save_world() from being called and leaving
incidents perpetually "active" on disk.
Fix: add _parse_ts(ts) -> float that handles int, float, and
ISO-8601 strings uniformly. All timestamp arithmetic now goes through
it; returns 0.0 on None / garbage to keep comparisons safe.
2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
and marks the incident "resolved" in memory, but if incidents.json was
truncated mid-write (pre-atomic-write era), the observer loaded it at
next startup with status="active" and no service entry pointing to it.
No code ever touched these orphans again.
Fix: _prune_stale_world now runs two cleanup passes each cycle:
- Case 1 (healthy-linked): service.status=="healthy" AND incident_id
still set → resolve immediately (service cannot have active incident)
- Case 2 (orphaned): active incident with no service link AND
last_occurrence > 5 min ago → resolve (5-min guard for creation race)
Both cases are wrapped in try/except so a bug here never crashes the
observer loop or blocks _save_world.
Also fixes the 7-day stale-incident prune to use _parse_ts so
ISO-string resolved_at values are handled correctly.
3. Operator UI: current_incidents() now filters to status=="active" only.
Resolved incidents were previously included in the /incidents endpoint,
making the dashboard show a wall of historical records as if active.
Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 new test cases in test_state_reliability.py covering:
- atomic_write_json: produces valid JSON, no .tmp left behind, overwrites,
works with nested structures
- _load_actual_state: returns False on empty / truncated file, returns True
on valid files, preserves last-known-good state across a parse failure
- reconcile: empty/truncated services.json or incidents.json generates zero
actions (skip-cycle semantics proven end-to-end)
- healthy service with valid world state generates no spurious action
All 32 tests (11 new + 21 existing) pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:
1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
all bare open('w')+json.dump calls in supervisor and executor, so the
action-file pipeline is never visible in a half-written state.
2. Resilient loader: _load_actual_state now returns False when any world
state file fails to parse (empty or truncated mid-write). reconcile()
skips the entire drift check on False instead of treating {} as "all
services missing". actual_state retains its last-known-good values so
a single bad cycle does not wipe accumulated context.
Before: parse error → raw[key]={} → all desired services missing →
wall of redeploy actions → drift_resolved_auto churn on next cycle.
After: parse error → WARNING logged → cycle skipped → no actions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.
Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Lesson from brain-watchdog: code that was never run had a packaging bug
that caused a crash loop in production. New rule: docker build + short
smoke-run + pytest before any commit or deploy.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7 cases: package importable, fresh ok, stale, unreachable, HTTP error,
missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src
so tests run without PYTHONPATH set in the environment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WORKDIR is /app but the package lives under src/; without PYTHONPATH set
`python -m brain_watchdog.main` raised ModuleNotFoundError on startup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Polls /summary on VPS over Tailscale every 60s; computes freshness
locally from last_update epoch (never trusts self-reported status).
Alerts via Telegram Bot API directly after 3 consecutive failures;
sends recovery message on heal. State (fail_count, alerted) persisted
to volume so debounce survives restarts.
- services/brain-watchdog/: Python service, no external deps (stdlib only)
- hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m
- hosts/piha/services.yaml + inventory/topology.yaml: manifest entries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>