Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.
Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
Port 8087 conflicted with zigbee2mqtt on piha (8087:8080 mapping active
for 7+ days), preventing ha-diag-agent from starting.
Grep across the full repo confirms no external consumer (no nginx/npm
proxy, no Prometheus scrape, no control-plane reference) uses this port.
The Docker healthcheck runs inside the container network namespace and
does not require a host-side mapping. Internal FastAPI binding on 8087
is unchanged.
Removed: ports section from docker-compose.yml and service.yaml.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- new per-host service, follows node-agent pattern
- 7 new HA event types defined (routing in supervisor — Phase 5)
- HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead)
- service.yaml + host configs for piha (ken) and chelsty-infra (chelsty)
- test scaffolding with aiohttp/aiosqlite mocks (15/15 passing)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>