homelab-codex-ws/services/ha-diag-agent/env.example
Oskar Kapala c255a021d1 fix(observer): quarantine malformed event files to prevent processing wedge
Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.

Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 11:22:56 +02:00

28 lines
825 B
Plaintext

# ha-diag-agent environment variables
# Copy to /opt/homelab/config/ha-diag-agent/.env on the target node
# Home Assistant connection (required)
HA_URL=http://homeassistant.local:8123
HA_TOKEN=your-long-lived-token-here
HA_TIMEOUT=10.0
# Node identity
NODE_NAME=piha
LOCATION_TAG=ken
# Check intervals (seconds)
CHECK_INTERVAL=60 # heartbeat check
CHECK_INTERVAL_UNAVAILABLE=3600 # entity availability check (1h)
# Unavailable entities thresholds
UNAVAILABLE_THRESHOLD_HOURS=24 # alert after N hours unavailable
INTEGRATION_FAILURE_THRESHOLD_PCT=0.5 # fraction of integration entities
INTEGRATION_FAILURE_MIN_ENTITIES=3 # minimum count for integration event
ALERT_COOLDOWN_HOURS=6 # suppress re-alert within N hours
# API server
PORT=8087
# Logging: debug, info, warning, error
LOG_LEVEL=info