homelab-codex-ws

History

Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve Two root causes for stale "active" incidents on the dashboard: 1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at can be an ISO-8601 string (stability-agent via events.py) or a Unix int (node-agent). The previous session's auto-resolve did plain `time.time() - last_occ` which raises TypeError for strings, silently preventing _save_world() from being called and leaving incidents perpetually "active" on disk. Fix: add _parse_ts(ts) -> float that handles int, float, and ISO-8601 strings uniformly. All timestamp arithmetic now goes through it; returns 0.0 on None / garbage to keep comparisons safe. 2. Orphaned active incidents: _resolve_incident clears service["incident_id"] and marks the incident "resolved" in memory, but if incidents.json was truncated mid-write (pre-atomic-write era), the observer loaded it at next startup with status="active" and no service entry pointing to it. No code ever touched these orphans again. Fix: _prune_stale_world now runs two cleanup passes each cycle: - Case 1 (healthy-linked): service.status=="healthy" AND incident_id still set → resolve immediately (service cannot have active incident) - Case 2 (orphaned): active incident with no service link AND last_occurrence > 5 min ago → resolve (5-min guard for creation race) Both cases are wrapped in try/except so a bug here never crashes the observer loop or blocks _save_world. Also fixes the 7-day stale-incident prune to use _parse_ts so ISO-string resolved_at values are handled correctly. 3. Operator UI: current_incidents() now filters to status=="active" only. Resolved incidents were previously included in the /incidents endpoint, making the dashboard show a wall of historical records as if active. Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json (now written atomically) and deletes old event files. No non-atomic writes found. Midnight clustering was likely external (logrotate / OS flush); the supervisor's resilient loader already handles such transient issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-06-03 14:29:12 +02:00
..
agent-system	fix(dashboard): read last_update from JSON content, not file mtime	2026-05-31 22:10:50 +02:00
brain-watchdog	test(brain-watchdog): add pytest suite covering import and check() logic	2026-06-01 20:38:24 +02:00
control-plane	fix(observer): robust incident lifecycle + orphan auto-resolve	2026-06-03 14:29:12 +02:00
forgejo	Add node capability model	2026-05-11 20:46:50 +02:00
ha-diag-agent	feat(control-plane): shadow_mode for HA event auto-actions + deploy docs	2026-05-29 17:12:33 +02:00
mosquitto	Implement filesystem-first runtime event system	2026-05-12 13:38:25 +02:00
node-agent	Fix ghost service keys from hash-prefixed Docker container names	2026-05-27 15:41:13 +02:00
node_exporter	Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring	2026-05-27 15:10:48 +02:00
npm	Add node capability model	2026-05-11 20:46:50 +02:00
ollama	Add node capability model	2026-05-11 20:46:50 +02:00
planner-agent	fix+debug(planner-agent): use base_url (not api_base) for litellm.acompletion, add print [TEMP]	2026-05-28 13:07:58 +02:00
stability-agent	Fix stability agent fleet deploy scripts	2026-05-17 21:09:06 +02:00
zigbee2mqtt	docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs	2026-05-29 14:17:23 +02:00
.gitkeep	Add infrastructure standards and deployment conventions	2026-05-07 21:16:03 +02:00