Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.
Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
Port 8087 is no longer mapped to the host. Operator verify commands
that used curl http://localhost:8087/health now use docker exec with
Python's urllib (the image is python:3.11-slim, no curl binary).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port 8087 conflicted with zigbee2mqtt on piha (8087:8080 mapping active
for 7+ days), preventing ha-diag-agent from starting.
Grep across the full repo confirms no external consumer (no nginx/npm
proxy, no Prometheus scrape, no control-plane reference) uses this port.
The Docker healthcheck runs inside the container network namespace and
does not require a host-side mapping. Internal FastAPI binding on 8087
is unchanged.
Removed: ports section from docker-compose.yml and service.yaml.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- main.py: rename event= to ha_event= in _log.warning() — structlog treats
'event' as a reserved positional arg; the old name caused TypeError when
any check returned unhealthy results (events were still emitted, but the
check was logged as check_error instead of check_unhealthy)
- tests/test_ha_client.py: replace aioresponses with unittest.mock — aioresponses
0.7.8 is incompatible with aiohttp >=3.12 (missing stream_writer kwarg)
- pyproject.toml: remove aioresponses from dev dependencies
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- persistent WS connection to HA with auth + state_changed subscription
- watchdog detects silence > 5min → emits ha_websocket_dead
- immediate ha_websocket_dead on disconnect, exponential reconnect with jitter
- cooldown prevents alert spam (10min repeat window while HA stays down)
- ha_websocket_recovered emitted on reconnect after a dead alert (allows
supervisor to clear active incidents in Phase 5)
- new monitors/ subpackage for long-running tasks (vs interval checks/)
- /health endpoint now includes ws_connected field
- 26 unit tests, 3 integration tests (real HA + container stop/restart)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- dockerized ken + chelsty HA test instances with template fixtures
- snapshot/reset/wait scripts for fixture management
- integration test infrastructure with separate marker
- location_tag promoted from metadata to event payload (Phase 1 flag #3)
- chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- new per-host service, follows node-agent pattern
- 7 new HA event types defined (routing in supervisor — Phase 5)
- HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead)
- service.yaml + host configs for piha (ken) and chelsty-infra (chelsty)
- test scaffolding with aiohttp/aiosqlite mocks (15/15 passing)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>