homelab-codex-ws/services
Oskar Kapala c9ee8eb06d fix(observer): quarantine malformed event files to prevent processing wedge
Recovery from bad merge of task/observer-poison-quarantine (c255a02)
which carried false deletes from a stale branch base. Re-applies only
the genuine observer changes on top of correct master state.

When an event file fails to parse (malformed JSON, truncated, corrupted),
the observer previously kept retrying on every cycle while the node's
checkpoint stayed pinned — all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures are logged but don't crash the loop.

Complementary to the atomic_write_json fix on state files; this addresses
the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 13:11:15 +02:00
..
agent-system fix(dashboard): read last_update from JSON content, not file mtime 2026-05-31 22:10:50 +02:00
brain-watchdog test(brain-watchdog): add pytest suite covering import and check() logic 2026-06-01 20:38:24 +02:00
control-plane fix(observer): quarantine malformed event files to prevent processing wedge 2026-06-12 13:11:15 +02:00
forgejo Add node capability model 2026-05-11 20:46:50 +02:00
ha-diag-agent docs(ha-diag-agent): replace curl verify commands with docker exec 2026-06-11 19:46:33 +02:00
mosquitto Implement filesystem-first runtime event system 2026-05-12 13:38:25 +02:00
node-agent fix(node-agent): run as uid 1000 with docker group access 2026-06-03 18:20:31 +02:00
node_exporter Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring 2026-05-27 15:10:48 +02:00
npm Add node capability model 2026-05-11 20:46:50 +02:00
ollama Add node capability model 2026-05-11 20:46:50 +02:00
planner-agent fix+debug(planner-agent): use base_url (not api_base) for litellm.acompletion, add print [TEMP] 2026-05-28 13:07:58 +02:00
stability-agent fix(stability-agent): run as uid 1000 with docker group access 2026-06-03 18:20:54 +02:00
zigbee2mqtt docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs 2026-05-29 14:17:23 +02:00
.gitkeep Add infrastructure standards and deployment conventions 2026-05-07 21:16:03 +02:00