Recovery from bad merge of task/observer-poison-quarantine (c255a02)
which carried false deletes from a stale branch base. Re-applies only
the genuine observer changes on top of correct master state.
When an event file fails to parse (malformed JSON, truncated, corrupted),
the observer previously kept retrying on every cycle while the node's
checkpoint stayed pinned — all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures are logged but don't crash the loop.
Complementary to the atomic_write_json fix on state files; this addresses
the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
Executor was the only control-plane container running as root (uid=0),
writing root-owned files to /opt/homelab via bind-mount and triggering
false sudo on every deploy.
- Dockerfile: add USER homelab after useradd (useradd already present)
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
(GID 999 = docker group on VPS) so executor retains docker.sock access
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
deploy-local.sh previously ran `sudo chown -R 1000:1000` and
`sudo chmod -R 775` unconditionally on every deploy, which blocked
non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000.
Both steps are now conditional using `find ... -print -quit`:
- chown: runs only if any file/dir is NOT uid/gid 1000
- chmod: runs only if any directory is missing -775 permission bits
When everything is correct (steady state on VPS), both steps log
"already correct, skipping" and never invoke sudo. If a new directory
was created by root (e.g. a manual mkdir, volume mount, or restart artefact),
the remediation path triggers automatically — the self-heal property is preserved.
Smoke-tested in Docker (ubuntu:22.04):
Case 1 (1000:1000 + 775): chown skipped, chmod skipped ✓
Case 2 (root-owned subdir): chown triggered ✓
Case 3 (700 dir perms): chmod triggered ✓
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two root causes for stale "active" incidents on the dashboard:
1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
can be an ISO-8601 string (stability-agent via events.py) or a Unix
int (node-agent). The previous session's auto-resolve did plain
`time.time() - last_occ` which raises TypeError for strings,
silently preventing _save_world() from being called and leaving
incidents perpetually "active" on disk.
Fix: add _parse_ts(ts) -> float that handles int, float, and
ISO-8601 strings uniformly. All timestamp arithmetic now goes through
it; returns 0.0 on None / garbage to keep comparisons safe.
2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
and marks the incident "resolved" in memory, but if incidents.json was
truncated mid-write (pre-atomic-write era), the observer loaded it at
next startup with status="active" and no service entry pointing to it.
No code ever touched these orphans again.
Fix: _prune_stale_world now runs two cleanup passes each cycle:
- Case 1 (healthy-linked): service.status=="healthy" AND incident_id
still set → resolve immediately (service cannot have active incident)
- Case 2 (orphaned): active incident with no service link AND
last_occurrence > 5 min ago → resolve (5-min guard for creation race)
Both cases are wrapped in try/except so a bug here never crashes the
observer loop or blocks _save_world.
Also fixes the 7-day stale-incident prune to use _parse_ts so
ISO-string resolved_at values are handled correctly.
3. Operator UI: current_incidents() now filters to status=="active" only.
Resolved incidents were previously included in the /incidents endpoint,
making the dashboard show a wall of historical records as if active.
Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 new test cases in test_state_reliability.py covering:
- atomic_write_json: produces valid JSON, no .tmp left behind, overwrites,
works with nested structures
- _load_actual_state: returns False on empty / truncated file, returns True
on valid files, preserves last-known-good state across a parse failure
- reconcile: empty/truncated services.json or incidents.json generates zero
actions (skip-cycle semantics proven end-to-end)
- healthy service with valid world state generates no spurious action
All 32 tests (11 new + 21 existing) pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:
1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
all bare open('w')+json.dump calls in supervisor and executor, so the
action-file pipeline is never visible in a half-written state.
2. Resilient loader: _load_actual_state now returns False when any world
state file fails to parse (empty or truncated mid-write). reconcile()
skips the entire drift check on False instead of treating {} as "all
services missing". actual_state retains its last-known-good values so
a single bad cycle does not wipe accumulated context.
Before: parse error → raw[key]={} → all desired services missing →
wall of redeploy actions → drift_resolved_auto churn on next cycle.
After: parse error → WARNING logged → cycle skipped → no actions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
operator_ui.py called .replace() on last_update without checking type —
an integer value (written by the materializer) raised AttributeError and
silently fell back to os.path.getmtime(), which was stuck at 5/29 after a
deploy with preserved timestamps. web.py had the same class of bug but
worse: it unconditionally replaced last_update with mtime, ignoring the
JSON field entirely. Both now branch on isinstance(str) and cast numeric
values directly to float, with mtime only as a last-resort fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously _cancel_resolved_pending_actions() only cancelled actions where
the service became healthy. This left orphaned actions when a service was
removed from services.yaml or marked monitor:false.
Add Case 1: if the action's svc_key is no longer in desired_state (either
removed entirely or skipped due to monitor:false), cancel with reason
service_removed_from_desired_state.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a service becomes healthy (node-agent emits service_healthy → observer
updates services.json), any previously queued redeploy/container_restart
action is stale. Without cleanup, the queue accumulates old actions that
require manual rejection.
_cancel_resolved_pending_actions() runs after each reconcile cycle:
- Reads all pending/*.json with type=redeploy or container_restart
- If the service is now healthy in actual_state, moves action to cancelled/
with reason=drift_resolved_auto
- Only pending actions are touched; approved/running are left to the operator
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
Inside a Docker container this returns the 12-char container ID (e.g.
'be17cb6eb0f6'), not the host name. Observer ingested those events and
created ghost entries in world/nodes.json that never expired.
observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
from topology inventory; called on every run_once() cycle (both new-events
and idle paths). Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
so the Dashboard's System Overview cards show real numbers instead of undefined.
operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
as keyed dicts; the frontend calls .map() which requires an array. All four
functions now convert the dict to a properly-shaped list. Each item has the
fields the Nodes, Services, Topology, Deployments, and Correlation views expect
(hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
default 24). Without this, every event file ever written was returned,
including events from ghost-node deploys.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>