Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the node's checkpoint forever — every cycle re-tried, logged, never advanced past the bad file; all subsequent good events for that node lost. Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/ with collision handling. Checkpoint advances, downstream events flow. Move failures themselves are logged but don't crash the loop. Complementary to yesterday's atomic_write_json fix (state files); this addresses the same race-pattern on event files instead. Regression test asserts: bad event quarantined to failed_events dir, removed from hot path, subsequent good event processed (node online), checkpoint moves to good event.
31 lines
790 B
YAML
31 lines
790 B
YAML
services:
|
|
ha-diag-agent:
|
|
build: .
|
|
container_name: ha-diag-agent
|
|
restart: unless-stopped
|
|
|
|
env_file:
|
|
- /opt/homelab/config/ha-diag-agent/.env
|
|
|
|
ports:
|
|
- "8087:8087"
|
|
|
|
volumes:
|
|
# Events dir: host path includes node name; inside container always /events
|
|
- /opt/homelab/events/${NODE_NAME:-ha-diag}:/events
|
|
# SQLite baseline cache and check history
|
|
- /var/lib/ha-diag-agent:/data
|
|
# Optional YAML config (read-only)
|
|
- /opt/homelab/config/ha-diag-agent:/config:ro
|
|
|
|
healthcheck:
|
|
test:
|
|
- "CMD"
|
|
- "python"
|
|
- "-c"
|
|
- "import urllib.request; urllib.request.urlopen('http://localhost:8087/health', timeout=5)"
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 20s
|