homelab-codex-ws/services/brain-watchdog/service.yaml
Oskar Kapala 039f9f7247 feat(piha): brain-watchdog — external watchdog for control-plane
Polls /summary on VPS over Tailscale every 60s; computes freshness
locally from last_update epoch (never trusts self-reported status).
Alerts via Telegram Bot API directly after 3 consecutive failures;
sends recovery message on heal. State (fail_count, alerted) persisted
to volume so debounce survives restarts.

- services/brain-watchdog/: Python service, no external deps (stdlib only)
- hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m
- hosts/piha/services.yaml + inventory/topology.yaml: manifest entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 17:54:36 +02:00

35 lines
1.2 KiB
YAML

service:
name: brain-watchdog
owner_node: piha
exposure: private
description: >
External watchdog for the control-plane on VPS. Queries /summary over
Tailscale and alerts via Telegram Bot API directly — no dependency on the
control-plane itself. Freshness is computed locally from last_update epoch.
dependencies:
- control-plane # external — on VPS; deliberately untrusted for liveness
healthcheck:
type: docker
interval: 60s
timeout: 10s
retries: 3
start_period: 30s
restart_policy: unless-stopped
persistence:
paths:
- /data # state.json: fail_count, alerted, last_ok
runtime:
env_vars:
- CONTROL_PLANE_URL # Tailscale IP + port of operator-ui (required)
- STALE_THRESHOLD # seconds before brain is considered stale (default: 600)
- INTERVAL # poll interval seconds (default: 60)
- FAILS_BEFORE_ALERT # consecutive failures before Telegram alert (default: 3)
- TG_TOKEN # Telegram Bot API token (required)
- TG_CHAT_ID # Telegram chat/user ID (required)
- HEALTHCHECKS_URL # optional healthchecks.io ping URL