Polls /summary on VPS over Tailscale every 60s; computes freshness locally from last_update epoch (never trusts self-reported status). Alerts via Telegram Bot API directly after 3 consecutive failures; sends recovery message on heal. State (fail_count, alerted) persisted to volume so debounce survives restarts. - services/brain-watchdog/: Python service, no external deps (stdlib only) - hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m - hosts/piha/services.yaml + inventory/topology.yaml: manifest entries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
35 lines
1.2 KiB
YAML
35 lines
1.2 KiB
YAML
service:
|
|
name: brain-watchdog
|
|
owner_node: piha
|
|
exposure: private
|
|
description: >
|
|
External watchdog for the control-plane on VPS. Queries /summary over
|
|
Tailscale and alerts via Telegram Bot API directly — no dependency on the
|
|
control-plane itself. Freshness is computed locally from last_update epoch.
|
|
|
|
dependencies:
|
|
- control-plane # external — on VPS; deliberately untrusted for liveness
|
|
|
|
healthcheck:
|
|
type: docker
|
|
interval: 60s
|
|
timeout: 10s
|
|
retries: 3
|
|
start_period: 30s
|
|
|
|
restart_policy: unless-stopped
|
|
|
|
persistence:
|
|
paths:
|
|
- /data # state.json: fail_count, alerted, last_ok
|
|
|
|
runtime:
|
|
env_vars:
|
|
- CONTROL_PLANE_URL # Tailscale IP + port of operator-ui (required)
|
|
- STALE_THRESHOLD # seconds before brain is considered stale (default: 600)
|
|
- INTERVAL # poll interval seconds (default: 60)
|
|
- FAILS_BEFORE_ALERT # consecutive failures before Telegram alert (default: 3)
|
|
- TG_TOKEN # Telegram Bot API token (required)
|
|
- TG_CHAT_ID # Telegram chat/user ID (required)
|
|
- HEALTHCHECKS_URL # optional healthchecks.io ping URL
|