homelab-codex-ws/services/ha-diag-agent/docker-compose.yml
Oskar Kapala c255a021d1 fix(observer): quarantine malformed event files to prevent processing wedge
Was: malformed event (bad JSON / truncated / corrupted bytes) wedged the
node's checkpoint forever — every cycle re-tried, logged, never advanced
past the bad file; all subsequent good events for that node lost.

Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures themselves are logged but don't crash the loop.

Complementary to yesterday's atomic_write_json fix (state files);
this addresses the same race-pattern on event files instead.

Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
2026-06-12 11:22:56 +02:00

31 lines
790 B
YAML

services:
ha-diag-agent:
build: .
container_name: ha-diag-agent
restart: unless-stopped
env_file:
- /opt/homelab/config/ha-diag-agent/.env
ports:
- "8087:8087"
volumes:
# Events dir: host path includes node name; inside container always /events
- /opt/homelab/events/${NODE_NAME:-ha-diag}:/events
# SQLite baseline cache and check history
- /var/lib/ha-diag-agent:/data
# Optional YAML config (read-only)
- /opt/homelab/config/ha-diag-agent:/config:ro
healthcheck:
test:
- "CMD"
- "python"
- "-c"
- "import urllib.request; urllib.request.urlopen('http://localhost:8087/health', timeout=5)"
interval: 30s
timeout: 10s
retries: 3
start_period: 20s