Recovery from bad merge of task/observer-poison-quarantine (c255a02)
which carried false deletes from a stale branch base. Re-applies only
the genuine observer changes on top of correct master state.
When an event file fails to parse (malformed JSON, truncated, corrupted),
the observer previously kept retrying on every cycle while the node's
checkpoint stayed pinned — all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures are logged but don't crash the loop.
Complementary to the atomic_write_json fix on state files; this addresses
the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
Port 8087 is no longer mapped to the host. Operator verify commands
that used curl http://localhost:8087/health now use docker exec with
Python's urllib (the image is python:3.11-slim, no curl binary).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port 8087 conflicted with zigbee2mqtt on piha (8087:8080 mapping active
for 7+ days), preventing ha-diag-agent from starting.
Grep across the full repo confirms no external consumer (no nginx/npm
proxy, no Prometheus scrape, no control-plane reference) uses this port.
The Docker healthcheck runs inside the container network namespace and
does not require a host-side mapping. Internal FastAPI binding on 8087
is unchanged.
Removed: ports section from docker-compose.yml and service.yaml.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- main.py: rename event= to ha_event= in _log.warning() — structlog treats
'event' as a reserved positional arg; the old name caused TypeError when
any check returned unhealthy results (events were still emitted, but the
check was logged as check_error instead of check_unhealthy)
- tests/test_ha_client.py: replace aioresponses with unittest.mock — aioresponses
0.7.8 is incompatible with aiohttp >=3.12 (missing stream_writer kwarg)
- pyproject.toml: remove aioresponses from dev dependencies
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stability-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.
- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) to retain docker.sock:ro access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.
- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) to retain docker.sock access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Executor was the only control-plane container running as root (uid=0),
writing root-owned files to /opt/homelab via bind-mount and triggering
false sudo on every deploy.
- Dockerfile: add USER homelab after useradd (useradd already present)
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) so executor retains docker.sock access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deploy-local.sh previously ran `sudo chown -R 1000:1000` and
`sudo chmod -R 775` unconditionally on every deploy, which blocked
non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000.
Both steps are now conditional using `find ... -print -quit`:
- chown: runs only if any file/dir is NOT uid/gid 1000
- chmod: runs only if any directory is missing -775 permission bits
When everything is correct (steady state on VPS), both steps log
"already correct, skipping" and never invoke sudo. If a new directory
was created by root (e.g. a manual mkdir, volume mount, or restart artefact),
the remediation path triggers automatically — the self-heal property is preserved.
Smoke-tested in Docker (ubuntu:22.04):
Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓
Case 2 (root-owned subdir): chown triggered ✓
Case 3 (700 dir perms): chmod triggered ✓
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two root causes for stale "active" incidents on the dashboard:
1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
can be an ISO-8601 string (stability-agent via events.py) or a Unix
int (node-agent). The previous session's auto-resolve did plain
`time.time() - last_occ` which raises TypeError for strings,
silently preventing _save_world() from being called and leaving
incidents perpetually "active" on disk.
Fix: add _parse_ts(ts) -> float that handles int, float, and
ISO-8601 strings uniformly. All timestamp arithmetic now goes through
it; returns 0.0 on None / garbage to keep comparisons safe.
2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
and marks the incident "resolved" in memory, but if incidents.json was
truncated mid-write (pre-atomic-write era), the observer loaded it at
next startup with status="active" and no service entry pointing to it.
No code ever touched these orphans again.
Fix: _prune_stale_world now runs two cleanup passes each cycle:
- Case 1 (healthy-linked): service.status=="healthy" AND incident_id
still set → resolve immediately (service cannot have active incident)
- Case 2 (orphaned): active incident with no service link AND
last_occurrence > 5 min ago → resolve (5-min guard for creation race)
Both cases are wrapped in try/except so a bug here never crashes the
observer loop or blocks _save_world.
Also fixes the 7-day stale-incident prune to use _parse_ts so
ISO-string resolved_at values are handled correctly.
3. Operator UI: current_incidents() now filters to status=="active" only.
Resolved incidents were previously included in the /incidents endpoint,
making the dashboard show a wall of historical records as if active.
Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 new test cases in test_state_reliability.py covering:
- atomic_write_json: produces valid JSON, no .tmp left behind, overwrites,
works with nested structures
- _load_actual_state: returns False on empty / truncated file, returns True
on valid files, preserves last-known-good state across a parse failure
- reconcile: empty/truncated services.json or incidents.json generates zero
actions (skip-cycle semantics proven end-to-end)
- healthy service with valid world state generates no spurious action
All 32 tests (11 new + 21 existing) pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:
1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
all bare open('w')+json.dump calls in supervisor and executor, so the
action-file pipeline is never visible in a half-written state.
2. Resilient loader: _load_actual_state now returns False when any world
state file fails to parse (empty or truncated mid-write). reconcile()
skips the entire drift check on False instead of treating {} as "all
services missing". actual_state retains its last-known-good values so
a single bad cycle does not wipe accumulated context.
Before: parse error → raw[key]={} → all desired services missing →
wall of redeploy actions → drift_resolved_auto churn on next cycle.
After: parse error → WARNING logged → cycle skipped → no actions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7 cases: package importable, fresh ok, stale, unreachable, HTTP error,
missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src
so tests run without PYTHONPATH set in the environment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WORKDIR is /app but the package lives under src/; without PYTHONPATH set
`python -m brain_watchdog.main` raised ModuleNotFoundError on startup.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Polls /summary on VPS over Tailscale every 60s; computes freshness
locally from last_update epoch (never trusts self-reported status).
Alerts via Telegram Bot API directly after 3 consecutive failures;
sends recovery message on heal. State (fail_count, alerted) persisted
to volume so debounce survives restarts.
- services/brain-watchdog/: Python service, no external deps (stdlib only)
- hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m
- hosts/piha/services.yaml + inventory/topology.yaml: manifest entries
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
operator_ui.py called .replace() on last_update without checking type —
an integer value (written by the materializer) raised AttributeError and
silently fell back to os.path.getmtime(), which was stuck at 5/29 after a
deploy with preserved timestamps. web.py had the same class of bug but
worse: it unconditionally replaced last_update with mtime, ignoring the
JSON field entirely. Both now branch on isinstance(str) and cast numeric
values directly to float, with mtime only as a last-resort fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- read risk_level with risk fallback (was: risk only → "unknown" for
all actions written by supervisor which uses risk_level key)
- include description field in alert format (was: alert_only payloads'
substance was invisible — description carried the full message)
- extract _format_pending_action() pure helper to enable unit testing
without a live Telegram connection
- 8 tests: risk_level present, risk fallback, both absent, description
shown/absent, truncation, full HA alert_only shape, no-description no-crash
- flagged during Phase 5 review of ha-diag-agent supervisor routing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- persistent WS connection to HA with auth + state_changed subscription
- watchdog detects silence > 5min → emits ha_websocket_dead
- immediate ha_websocket_dead on disconnect, exponential reconnect with jitter
- cooldown prevents alert spam (10min repeat window while HA stays down)
- ha_websocket_recovered emitted on reconnect after a dead alert (allows
supervisor to clear active incidents in Phase 5)
- new monitors/ subpackage for long-running tasks (vs interval checks/)
- /health endpoint now includes ws_connected field
- 26 unit tests, 3 integration tests (real HA + container stop/restart)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- dockerized ken + chelsty HA test instances with template fixtures
- snapshot/reset/wait scripts for fixture management
- integration test infrastructure with separate marker
- location_tag promoted from metadata to event payload (Phase 1 flag #3)
- chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- new per-host service, follows node-agent pattern
- 7 new HA event types defined (routing in supervisor — Phase 5)
- HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead)
- service.yaml + host configs for piha (ken) and chelsty-infra (chelsty)
- test scaffolding with aiohttp/aiosqlite mocks (15/15 passing)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
litellm.acompletion() has base_url as a named param; api_base only works
via **kwargs fallback path. Switching to base_url ensures the value lands
correctly in completion_kwargs and reaches the ollama provider.
Print() added (not logger) so base_url is always visible in docker logs
regardless of log level.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All runtime vars (REDIS_URL, OLLAMA_HOST, OLLAMA_MODEL, NODE_NAME,
COOLDOWN_SECONDS, RUNTIME_PATH) are sourced from the host-local
/opt/homelab/config/planner-agent/.env via env_file.
Only ANTHROPIC_API_KEY stays in environment (not in env_file — secret
injected at runtime by the operator when needed).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
services/agent-system/runtime-materializer/materializer.py:
- Add materialize_from_api() that fetches all world-state endpoints
from the control-plane HTTP API (CONTROL_PLANE_URL env var)
- When CONTROL_PLANE_URL is set, use API as source of truth instead of Redis
- Redis path preserved as fallback for backward compat
hosts/piha/runtime/agent-system/docker-compose.override.yml (new):
- Inject CONTROL_PLANE_URL=http://100.95.58.48:18180 for runtime-materializer
- piha webui /snapshot now mirrors VPS observer output (clean, ghost-free)
Root cause: materializer read from Redis which held 80 stale service entries
with hash-prefixed ghost keys (e.g. 0ccb8a88e079_control-plane-supervisor).
Redis is never updated by the current observer pipeline; the control-plane API
is the single authoritative world-state source.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)
observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage
control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously _cancel_resolved_pending_actions() only cancelled actions where
the service became healthy. This left orphaned actions when a service was
removed from services.yaml or marked monitor:false.
Add Case 1: if the action's svc_key is no longer in desired_state (either
removed entirely or skipped due to monitor:false), cancel with reason
service_removed_from_desired_state.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a service becomes healthy (node-agent emits service_healthy → observer
updates services.json), any previously queued redeploy/container_restart
action is stale. Without cleanup, the queue accumulates old actions that
require manual rejection.
_cancel_resolved_pending_actions() runs after each reconcile cycle:
- Reads all pending/*.json with type=redeploy or container_restart
- If the service is now healthy in actual_state, moves action to cancelled/
with reason=drift_resolved_auto
- Only pending actions are touched; approved/running are left to the operator
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Multiple service_healthy (or containers_not_running) events emitted in the
same second for different containers shared the same filename pattern
evt-{node}-{ts}-{type}.json — the second write silently overwrote the first,
so the observer only ever saw the last container checked per event type per cycle.
Fix: include a sanitized service name slug in the ID so every event gets a
unique file, e.g. evt-vps-1234-service_healthy-node-agent.json.
Also adds import re (required for re.sub in the slug generation).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- node_agent: emit service_healthy for all running managed containers so
observer populates services.json (previously empty → supervisor flooded
action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
format from observer checkpoint (was reading old last_processed_file key,
always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
without resolving incidents (unlike service_recovered which also resolves)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When ~/.ssh is mounted from the host oskar user into a container that
runs as root, OpenSSH rejects ~/.ssh/config with 'Bad owner or
permissions' because the file UID doesn't match the running process.
Add -F /dev/null to the rsync SSH command to skip the config file
entirely. Also add UserKnownHostsFile=/dev/null so no known_hosts
write is attempted into a potentially read-only mounted .ssh dir.
The key itself (/root/.ssh/id_rsa) is still read as an implicit
default identity and is not affected by -F.
Reproduces on chelsty-infra (has ~/.ssh/config); safe for all nodes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
Inside a Docker container this returns the 12-char container ID (e.g.
'be17cb6eb0f6'), not the host name. Observer ingested those events and
created ghost entries in world/nodes.json that never expired.
observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
from topology inventory; called on every run_once() cycle (both new-events
and idle paths). Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
so the Dashboard's System Overview cards show real numbers instead of undefined.
operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
as keyed dicts; the frontend calls .map() which requires an array. All four
functions now convert the dict to a properly-shaped list. Each item has the
fields the Nodes, Services, Topology, Deployments, and Correlation views expect
(hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
default 24). Without this, every event file ever written was returned,
including events from ghost-node deploys.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>