homelab-codex-ws/docs/sessions
Oskar Kapala c255a021d1 fix(observer): quarantine malformed event files to prevent processing wedge
Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.

Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 11:22:56 +02:00
..
2026-05-27-planner-agent.md docs: add planner-agent docs and session summary 2026-05-27 2026-05-27 22:35:59 +02:00
2026-05-27.md docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs 2026-05-27 16:18:31 +02:00
2026-06-08-lustro-onboarding.md docs(backlog): observer staleness — dead node shows NOMINAL (heartbeat TTL) 2026-06-09 12:16:59 +02:00
2026-06-09-flota-recovery-lustro-register.md docs: session 2026-06-09 + skill/backlog update 2026-06-09 20:38:35 +02:00