Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.
Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
Port 8087 is no longer mapped to the host. Operator verify commands
that used curl http://localhost:8087/health now use docker exec with
Python's urllib (the image is python:3.11-slim, no curl binary).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>