Executor was the only control-plane container running as root (uid=0),
writing root-owned files to /opt/homelab via bind-mount and triggering
false sudo on every deploy.
- Dockerfile: add USER homelab after useradd (useradd already present)
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) so executor retains docker.sock access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
grep -cv (and grep -v) return exit code 1 when there are zero matches.
With set -euo pipefail this silently aborted the script before count
was returned — causing 'agent.sh new' to fail on a fresh repo with no
existing worktrees.
Fix: move the grep -v into worktree_paths with '|| true' so the
function always exits 0, then derive worktree_count via wc -l.
Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail
was silently exiting after the loop because the failing-test compound
propagated exit code 1 through the function return.
Add ERR trap so future silent fails get diagnosed at the source.
Encodes branch hygiene for CC running in task worktrees: commit only to
assigned branch, no push origin master, no touching main checkout, no
git add -A, no worktree management, mandatory final report.
Records session facts (git log, diff --stat, deploys from transcript)
by appending to docs/sessions/YYYY-MM-DD.md with a mandatory narrative
placeholder. Never touches backlog.md or CLAUDE.md without explicit
instruction. Commits only the session file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instructs CC to always route deploy/redeploy/ship/wdróż requests through
scripts/deploy/deploy.sh, maps exit codes to required actions, and
enforces no-bypass rules for gate and branch checks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the per-node staged framework with a single entry point that
runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest +
docker build per service), execute (control-plane.sh --ssh or remote
deploy-node.sh), verify (docker ps), and one-line report.
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deploy-local.sh previously ran `sudo chown -R 1000:1000` and
`sudo chmod -R 775` unconditionally on every deploy, which blocked
non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000.
Both steps are now conditional using `find ... -print -quit`:
- chown: runs only if any file/dir is NOT uid/gid 1000
- chmod: runs only if any directory is missing -775 permission bits
When everything is correct (steady state on VPS), both steps log
"already correct, skipping" and never invoke sudo. If a new directory
was created by root (e.g. a manual mkdir, volume mount, or restart artefact),
the remediation path triggers automatically — the self-heal property is preserved.
Smoke-tested in Docker (ubuntu:22.04):
Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓
Case 2 (root-owned subdir): chown triggered ✓
Case 3 (700 dir perms): chmod triggered ✓
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two root causes for stale "active" incidents on the dashboard:
1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
can be an ISO-8601 string (stability-agent via events.py) or a Unix
int (node-agent). The previous session's auto-resolve did plain
`time.time() - last_occ` which raises TypeError for strings,
silently preventing _save_world() from being called and leaving
incidents perpetually "active" on disk.
Fix: add _parse_ts(ts) -> float that handles int, float, and
ISO-8601 strings uniformly. All timestamp arithmetic now goes through
it; returns 0.0 on None / garbage to keep comparisons safe.
2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
and marks the incident "resolved" in memory, but if incidents.json was
truncated mid-write (pre-atomic-write era), the observer loaded it at
next startup with status="active" and no service entry pointing to it.
No code ever touched these orphans again.
Fix: _prune_stale_world now runs two cleanup passes each cycle:
- Case 1 (healthy-linked): service.status=="healthy" AND incident_id
still set → resolve immediately (service cannot have active incident)
- Case 2 (orphaned): active incident with no service link AND
last_occurrence > 5 min ago → resolve (5-min guard for creation race)
Both cases are wrapped in try/except so a bug here never crashes the
observer loop or blocks _save_world.
Also fixes the 7-day stale-incident prune to use _parse_ts so
ISO-string resolved_at values are handled correctly.
3. Operator UI: current_incidents() now filters to status=="active" only.
Resolved incidents were previously included in the /incidents endpoint,
making the dashboard show a wall of historical records as if active.
Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 new test cases in test_state_reliability.py covering:
- atomic_write_json: produces valid JSON, no .tmp left behind, overwrites,
works with nested structures
- _load_actual_state: returns False on empty / truncated file, returns True
on valid files, preserves last-known-good state across a parse failure
- reconcile: empty/truncated services.json or incidents.json generates zero
actions (skip-cycle semantics proven end-to-end)
- healthy service with valid world state generates no spurious action
All 32 tests (11 new + 21 existing) pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:
1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
all bare open('w')+json.dump calls in supervisor and executor, so the
action-file pipeline is never visible in a half-written state.
2. Resilient loader: _load_actual_state now returns False when any world
state file fails to parse (empty or truncated mid-write). reconcile()
skips the entire drift check on False instead of treating {} as "all
services missing". actual_state retains its last-known-good values so
a single bad cycle does not wipe accumulated context.
Before: parse error → raw[key]={} → all desired services missing →
wall of redeploy actions → drift_resolved_auto churn on next cycle.
After: parse error → WARNING logged → cycle skipped → no actions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.
Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Lesson from brain-watchdog: code that was never run had a packaging bug
that caused a crash loop in production. New rule: docker build + short
smoke-run + pytest before any commit or deploy.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7 cases: package importable, fresh ok, stale, unreachable, HTTP error,
missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src
so tests run without PYTHONPATH set in the environment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WORKDIR is /app but the package lives under src/; without PYTHONPATH set
`python -m brain_watchdog.main` raised ModuleNotFoundError on startup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Polls /summary on VPS over Tailscale every 60s; computes freshness
locally from last_update epoch (never trusts self-reported status).
Alerts via Telegram Bot API directly after 3 consecutive failures;
sends recovery message on heal. State (fail_count, alerted) persisted
to volume so debounce survives restarts.
- services/brain-watchdog/: Python service, no external deps (stdlib only)
- hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m
- hosts/piha/services.yaml + inventory/topology.yaml: manifest entries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
operator_ui.py called .replace() on last_update without checking type —
an integer value (written by the materializer) raised AttributeError and
silently fell back to os.path.getmtime(), which was stuck at 5/29 after a
deploy with preserved timestamps. web.py had the same class of bug but
worse: it unconditionally replaced last_update with mtime, ignoring the
JSON field entirely. Both now branch on isinstance(str) and cast numeric
values directly to float, with mtime only as a last-resort fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- read risk_level with risk fallback (was: risk only → "unknown" for
all actions written by supervisor which uses risk_level key)
- include description field in alert format (was: alert_only payloads'
substance was invisible — description carried the full message)
- extract _format_pending_action() pure helper to enable unit testing
without a live Telegram connection
- 8 tests: risk_level present, risk fallback, both absent, description
shown/absent, truncation, full HA alert_only shape, no-description no-crash
- flagged during Phase 5 review of ha-diag-agent supervisor routing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- persistent WS connection to HA with auth + state_changed subscription
- watchdog detects silence > 5min → emits ha_websocket_dead
- immediate ha_websocket_dead on disconnect, exponential reconnect with jitter
- cooldown prevents alert spam (10min repeat window while HA stays down)
- ha_websocket_recovered emitted on reconnect after a dead alert (allows
supervisor to clear active incidents in Phase 5)
- new monitors/ subpackage for long-running tasks (vs interval checks/)
- /health endpoint now includes ws_connected field
- 26 unit tests, 3 integration tests (real HA + container stop/restart)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- dockerized ken + chelsty HA test instances with template fixtures
- snapshot/reset/wait scripts for fixture management
- integration test infrastructure with separate marker
- location_tag promoted from metadata to event payload (Phase 1 flag #3)
- chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- new per-host service, follows node-agent pattern
- 7 new HA event types defined (routing in supervisor — Phase 5)
- HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead)
- service.yaml + host configs for piha (ken) and chelsty-infra (chelsty)
- test scaffolding with aiohttp/aiosqlite mocks (15/15 passing)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
litellm.acompletion() has base_url as a named param; api_base only works
via **kwargs fallback path. Switching to base_url ensures the value lands
correctly in completion_kwargs and reaches the ollama provider.
Print() added (not logger) so base_url is always visible in docker logs
regardless of log level.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All runtime vars (REDIS_URL, OLLAMA_HOST, OLLAMA_MODEL, NODE_NAME,
COOLDOWN_SECONDS, RUNTIME_PATH) are sourced from the host-local
/opt/homelab/config/planner-agent/.env via env_file.
Only ANTHROPIC_API_KEY stays in environment (not in env_file — secret
injected at runtime by the operator when needed).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
services/agent-system/runtime-materializer/materializer.py:
- Add materialize_from_api() that fetches all world-state endpoints
from the control-plane HTTP API (CONTROL_PLANE_URL env var)
- When CONTROL_PLANE_URL is set, use API as source of truth instead of Redis
- Redis path preserved as fallback for backward compat
hosts/piha/runtime/agent-system/docker-compose.override.yml (new):
- Inject CONTROL_PLANE_URL=http://100.95.58.48:18180 for runtime-materializer
- piha webui /snapshot now mirrors VPS observer output (clean, ghost-free)
Root cause: materializer read from Redis which held 80 stale service entries
with hash-prefixed ghost keys (e.g. 0ccb8a88e079_control-plane-supervisor).
Redis is never updated by the current observer pipeline; the control-plane API
is the single authoritative world-state source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)
observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage
control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.
Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously _cancel_resolved_pending_actions() only cancelled actions where
the service became healthy. This left orphaned actions when a service was
removed from services.yaml or marked monitor:false.
Add Case 1: if the action's svc_key is no longer in desired_state (either
removed entirely or skipped due to monitor:false), cancel with reason
service_removed_from_desired_state.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
z2m migrates configuration.yaml on startup and needs write access.
Remove the separate :ro config mount; rely on the base compose's
/opt/homelab/data/zigbee2mqtt/data:/app/data read-write mount instead.
configuration.yaml must exist at that path on the node before first run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docker-compose v1 cannot clear the ports list from the base compose with
ports: [] in an override, so network_mode: host caused InvalidArgument.
Use extra_hosts with host-gateway instead: maps 'mosquitto' hostname to the
Docker bridge gateway IP so mqtt://mosquitto:1883 reaches the host-networked
mosquitto process from within the bridge-networked z2m container.
Requires Docker 20.10+ (present on chelsty-infra).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docker-compose v1 (1.29.2 on chelsty-infra) raises InvalidArgument when
network_mode: host is combined with port_bindings from the base compose file.
Add ports: [] in the override to clear the base ports list.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a service becomes healthy (node-agent emits service_healthy → observer
updates services.json), any previously queued redeploy/container_restart
action is stale. Without cleanup, the queue accumulates old actions that
require manual rejection.
_cancel_resolved_pending_actions() runs after each reconcile cycle:
- Reads all pending/*.json with type=redeploy or container_restart
- If the service is now healthy in actual_state, moves action to cancelled/
with reason=drift_resolved_auto
- Only pending actions are touched; approved/running are left to the operator
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>