- HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.6 KiB
ha-diag-agent Deployment Guide
Section 1: Prerequisites
HA long-lived access token
The agent authenticates to Home Assistant with a long-lived token issued by a dedicated service account. Do not use a personal admin token.
- In HA: Settings → People → Add Person
- Name:
diag_agent - Do not add to any group (no admin rights needed)
- Name:
- Log in to HA as
diag_agent - Go to Profile → Long-Lived Access Tokens → Create token
- Name:
ha-diag-agent - Copy the token — it is shown only once
- Name:
- Store the token in the node's
.envfile (see Section 2)
Tailnet reachability check (chelsty-infra only)
chelsty-infra reaches Home Assistant on chelsty-ha over Tailscale.
Verify before deploying:
curl -sf http://100.70.180.90:8123/api/ \
-H "Authorization: Bearer <token>" | python3 -m json.tool
# Expect: {"message": "API running."}
If the request times out, check that both nodes are on the Tailscale mesh
(tailscale status) and that chelsty-ha is powered on.
Section 2: Per-host config
Create /opt/homelab/config/ha-diag-agent/.env on each target node:
piha
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://localhost:8123
HA_TOKEN=<long-lived-token-for-piha>
NODE_NAME=piha
LOCATION_TAG=ken
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
chelsty-infra
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://100.70.180.90:8123
HA_TOKEN=<long-lived-token-for-chelsty-ha>
NODE_NAME=chelsty-infra
LOCATION_TAG=chelsty
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
If
chelsty-hagets a new Tailscale IP, updateHA_URLin this file and restart the container.
Section 3: Deploy procedure
From SATURN (standard flow)
# 1. Commit and push changes from SATURN
git push
# 2. SSH to target node
ssh oskar@piha # or chelsty-infra
# 3. Pull latest and deploy
cd ~/homelab-codex-ws
git pull
scripts/deploy/deploy.sh --service ha-diag-agent
chelsty-infra (docker-compose v1)
chelsty-infra runs docker-compose v1 (1.29.2). The deploy script calls
docker-compose (hyphenated), which is correct. If you need to run manually:
cd ~/homelab-codex-ws/services/ha-diag-agent
docker-compose up -d --build
Section 4: Verification
# Container is up
docker ps | grep ha-diag-agent
# Last 50 log lines
docker logs ha-diag-agent --tail 50
# FastAPI health endpoint
curl http://localhost:8087/health
# Expect: {"status": "ok", "ws_connected": true, ...}
# Events are being written
ls /opt/homelab/events/<node-name>/
# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds
# Supervisor is picking up events (check on VPS / control-plane)
tail -f /opt/homelab/logs/supervisor.log | grep ha_
Section 5: First-48h observation (shadow mode)
The supervisor starts with HA_DIAG_SHADOW_MODE=true (default). During this
window, ha_websocket_dead events are downgraded to alert_only actions
tagged [SHADOW MODE] rather than triggering an automatic restart.
Watch for these signals in Telegram:
[SHADOW MODE] would have triggered container_restart for homeassistant— confirms the detection path works end-to-endha_entity_unavailable_long/ha_integration_failed/ etc. — these are alwaysalert_onlyregardless of shadow mode; verify descriptions look accurate and thresholds are reasonable
Things to evaluate:
| Question | Good sign |
|---|---|
| Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node |
| Are there false positives? | No alerts during known-good uptime |
| Are entity-unavailable alerts describing real entities? | Yes, names match HA UI |
| Are integration-failed alerts genuine? | Yes, not noise from startup |
Note any false positives or noisy thresholds before enabling production mode.
Section 6: Enabling production mode
HA_DIAG_SHADOW_MODE is an environment variable read by the supervisor
container. The VPS supervisor env vars live in the version-controlled
override file at hosts/vps/runtime/control-plane/docker-compose.override.yml
(not in a runtime .env file — the supervisor has no env_file: directive).
When the 48h observation period looks clean:
1. Edit the override file on SATURN:
# hosts/vps/runtime/control-plane/docker-compose.override.yml
services:
supervisor:
environment:
- NODE_ALIAS_MAP={"node-2":"chelsty"}
- HA_DIAG_SHADOW_MODE=false # add this line
2. Commit and push from SATURN:
git add hosts/vps/runtime/control-plane/docker-compose.override.yml
git commit -m "feat(control-plane): disable HA shadow mode — production ready"
git push
3. Apply on VPS:
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
-f services/control-plane/docker-compose.yml \
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
up -d supervisor
4. Confirm:
docker logs control-plane-supervisor --tail 5
# Expect: shadow_mode=False — HA container_restart actions enabled
From this point, the next ha_websocket_dead event will generate a
container_restart action in the approval queue. The 30-minute cooldown
still applies after each restart.
Section 7: Rollback
If production mode causes unexpected behaviour:
# Option A — re-enable shadow mode
# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
# Commit, push, then on VPS:
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
-f services/control-plane/docker-compose.yml \
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
up -d supervisor
# Option B — stop ha-diag-agent entirely on affected nodes
ssh oskar@<node>
docker stop ha-diag-agent
# Events written before rollback remain in /opt/homelab/events/<node>/
# and are historical only — no automated action will be taken on them
# unless the supervisor re-processes them, which it won't (already in
# _ha_processed_event_ids).
Any container_restart actions still in pending/ after rollback can be
manually rejected via the Telegram bot or by deleting the action files from
/opt/homelab/actions/pending/ on the VPS.