homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	c255a021d1	fix(observer): quarantine malformed event files to prevent processing wedge Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the node's checkpoint forever — every cycle re-tried, logged, never advanced past the bad file; all subsequent good events for that node lost. Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/ with collision handling. Checkpoint advances, downstream events flow. Move failures themselves are logged but don't crash the loop. Complementary to yesterday's atomic_write_json fix (state files); this addresses the same race-pattern on event files instead. Regression test asserts: bad event quarantined to failed_events dir, removed from hot path, subsequent good event processed (node online), checkpoint moves to good event.	2026-06-12 11:22:56 +02:00
Oskar Kapala	fa59625aa6	docs(ha-diag-agent): replace curl verify commands with docker exec Port 8087 is no longer mapped to the host. Operator verify commands that used curl http://localhost:8087/health now use docker exec with Python's urllib (the image is python:3.11-slim, no curl binary). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 19:46:33 +02:00
Oskar Kapala	d7e0d3162f	fix(ha-diag-agent): remove host port mapping for 8087 Port 8087 conflicted with zigbee2mqtt on piha (8087:8080 mapping active for 7+ days), preventing ha-diag-agent from starting. Grep across the full repo confirms no external consumer (no nginx/npm proxy, no Prometheus scrape, no control-plane reference) uses this port. The Docker healthcheck runs inside the container network namespace and does not require a host-side mapping. Internal FastAPI binding on 8087 is unchanged. Removed: ports section from docker-compose.yml and service.yaml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 19:46:28 +02:00
Oskar Kapala	5e9db5c106	fix(ha-diag-agent): structlog event kwarg collision + replace aioresponses - main.py: rename event= to ha_event= in _log.warning() — structlog treats 'event' as a reserved positional arg; the old name caused TypeError when any check returned unhealthy results (events were still emitted, but the check was logged as check_error instead of check_unhealthy) - tests/test_ha_client.py: replace aioresponses with unittest.mock — aioresponses 0.7.8 is incompatible with aiohttp >=3.12 (missing stream_writer kwarg) - pyproject.toml: remove aioresponses from dev dependencies Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 14:10:06 +02:00
Oskar Kapala	d60b28a949	feat(ha-diag-agent): add piha deploy config - hosts/piha/runtime/ha-diag-agent/docker-compose.override.yml: mem_limit 128m, hardcoded events volume (/opt/homelab/events/piha:/events) to avoid ${NODE_NAME} shell-expansion issue in deploy-node.sh - services/ha-diag-agent/env.example: per-host HA_URL comments (piha vs chelsty-infra tailscale), HA_TOKEN source note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 14:10:06 +02:00
Oskar Kapala	52607a7cdd	feat(control-plane): shadow_mode for HA event auto-actions + deploy docs - HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 17:12:33 +02:00
Oskar Kapala	31b48d162a	feat(ha-diag-agent): WebSocketMonitor for real-time HA liveness - persistent WS connection to HA with auth + state_changed subscription - watchdog detects silence > 5min → emits ha_websocket_dead - immediate ha_websocket_dead on disconnect, exponential reconnect with jitter - cooldown prevents alert spam (10min repeat window while HA stays down) - ha_websocket_recovered emitted on reconnect after a dead alert (allows supervisor to clear active incidents in Phase 5) - new monitors/ subpackage for long-running tasks (vs interval checks/) - /health endpoint now includes ws_connected field - 26 unit tests, 3 integration tests (real HA + container stop/restart) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 15:00:18 +02:00
Oskar Kapala	3499b2f280	feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes New checks: - SystemHealthCheck (15min interval): detects newly-failing HA integrations via /api/system_health snapshot diff; transition-based dedup (ok→error fires, sustained error silent, error→ok clears alert) - UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available events with 7-day dedup; release notes truncated at 2000 chars - UpdatesDigestCheck (Sunday cron 09:00): single digest event with all pending updates; weekly ISO-week dedup, independent of daily dedup key - AutomationFailuresCheck (30min interval): detects automations with N consecutive failures (default 3) via /api/trace/automation/<id>; 6h cooldown per automation Phase 3 flag fixes: - Flag #1 (since field): UnavailableEntitiesCheck now uses min(state.last_changed, baseline.first_seen) as effective "since", giving accurate duration when agent was offline at entity's first fail - Flag #3 (registry cache): HAClient.get_entity_registry() caches response in-process with configurable TTL (default 300s); avoids repeated API calls across concurrent check cycles; invalidate_registry_cache() for manual invalidation Storage: system_health_snapshot table (component, last_status, last_seen_at, payload) created automatically on next Storage.open() call Config additions (all with defaults): entity_registry_cache_ttl=300, system_health_check_interval=900, automation_check_interval=1800, automation_failure_threshold=3, updates_check_hour=9, updates_check_minute=0, updates_cooldown_days=7 Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new); 3 skipped (live-HA token not set in CI) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 14:43:10 +02:00
Oskar Kapala	20f6761a67	feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup - shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed): make_session() factory, session injected at startup, closed on shutdown - Check.run() → list[CheckResult]: clean multi-event interface - first real diagnostic check: entity unavailable > 24h (INSERT OR IGNORE baseline preserves first-seen timestamp) - root cause grouping: emit ha_integration_failed instead of N entity events when ≥50% of integration's entities are unavailable (≥3 min) - alert deduplication via SQLite cooldown window (default 6h) - recovery clears baseline + dedup for immediate re-alert - configurable thresholds: duration, integration %, cooldown - 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 13:41:55 +02:00
Oskar Kapala	07bd498fd6	feat(ha-diag-agent): test environment with dual HA Docker instances - dockerized ken + chelsty HA test instances with template fixtures - snapshot/reset/wait scripts for fixture management - integration test infrastructure with separate marker - location_tag promoted from metadata to event payload (Phase 1 flag #3) - chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:56:13 +02:00
Oskar Kapala	90c8e77bf7	chore: gitignore *.egg-info, remove committed egg-info Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:57 +02:00
Oskar Kapala	ab8895d28b	feat(ha-diag-agent): scaffold service with HA REST client and event emitter - new per-host service, follows node-agent pattern - 7 new HA event types defined (routing in supervisor — Phase 5) - HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead) - service.yaml + host configs for piha (ken) and chelsty-infra (chelsty) - test scaffolding with aiohttp/aiosqlite mocks (15/15 passing) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-29 12:26:34 +02:00

12 commits