homelab-codex-ws/services/ha-diag-agent/service.yaml
Oskar Kapala c255a021d1 fix(observer): quarantine malformed event files to prevent processing wedge
Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.

Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 11:22:56 +02:00

43 lines
1.4 KiB
YAML

service:
name: ha-diag-agent
# Deployed per-host: piha (site: ken) and chelsty-infra (site: chelsty)
owner_node: per-host
exposure: local-only
monitor: true
dependencies:
- homeassistant
ports:
- 8087
healthcheck:
type: http
path: /health
interval: 30s
timeout: 10s
retries: 3
start_period: 20s
restart_policy: unless-stopped
persistence:
paths:
- /opt/homelab/events
- /var/lib/ha-diag-agent
runtime:
env_vars:
- HA_TOKEN # long-lived HA access token (required)
- HA_URL # http://homeassistant.local:8123
- NODE_NAME # canonical node name: piha, chelsty-infra
- LOCATION_TAG # human site label: ken, chelsty
- CHECK_INTERVAL # heartbeat interval seconds (default: 60)
- CHECK_INTERVAL_UNAVAILABLE # entity check interval seconds (default: 3600)
- UNAVAILABLE_THRESHOLD_HOURS # alert threshold (default: 24)
- INTEGRATION_FAILURE_THRESHOLD_PCT # fraction threshold (default: 0.5)
- INTEGRATION_FAILURE_MIN_ENTITIES # min count for integration event (default: 3)
- ALERT_COOLDOWN_HOURS # re-alert suppression (default: 6)
- PORT # FastAPI port (default: 8087)
- LOG_LEVEL # default: info