Oskar Kapala 52607a7cdd feat(control-plane): shadow_mode for HA event auto-actions + deploy docs

- HA_DIAG_SHADOW_MODE env flag in supervisor (default true)
- shadow_mode downgrades container_restart actions to alert_only with
  [SHADOW MODE] note; same action_id and 30-min cooldown apply
- alert_only events unaffected (always routed normally)
- 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected
- DEPLOY.md with token gen, per-host config, verification, 48h observation,
  production-mode enablement, rollback
- README.md updated with shadow mode flag summary and DEPLOY.md link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-29 17:12:33 +02:00

6.6 KiB

Raw Blame History

ha-diag-agent Deployment Guide

Section 1: Prerequisites

HA long-lived access token

The agent authenticates to Home Assistant with a long-lived token issued by a dedicated service account. Do not use a personal admin token.

In HA: Settings → People → Add Person
- Name: diag_agent
- Do not add to any group (no admin rights needed)
Log in to HA as diag_agent
Go to Profile → Long-Lived Access Tokens → Create token
- Name: ha-diag-agent
- Copy the token — it is shown only once
Store the token in the node's .env file (see Section 2)

Tailnet reachability check (chelsty-infra only)

chelsty-infra reaches Home Assistant on chelsty-ha over Tailscale. Verify before deploying:

curl -sf http://100.70.180.90:8123/api/ \
  -H "Authorization: Bearer <token>" | python3 -m json.tool
# Expect: {"message": "API running."}

If the request times out, check that both nodes are on the Tailscale mesh (tailscale status) and that chelsty-ha is powered on.

Section 2: Per-host config

Create /opt/homelab/config/ha-diag-agent/.env on each target node:

piha

mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://localhost:8123
HA_TOKEN=<long-lived-token-for-piha>
NODE_NAME=piha
LOCATION_TAG=ken
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env

chelsty-infra

mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://100.70.180.90:8123
HA_TOKEN=<long-lived-token-for-chelsty-ha>
NODE_NAME=chelsty-infra
LOCATION_TAG=chelsty
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env

If chelsty-ha gets a new Tailscale IP, update HA_URL in this file and restart the container.

Section 3: Deploy procedure

From SATURN (standard flow)

# 1. Commit and push changes from SATURN
git push

# 2. SSH to target node
ssh oskar@piha            # or chelsty-infra

# 3. Pull latest and deploy
cd ~/homelab-codex-ws
git pull
scripts/deploy/deploy.sh --service ha-diag-agent

chelsty-infra (docker-compose v1)

chelsty-infra runs docker-compose v1 (1.29.2). The deploy script calls docker-compose (hyphenated), which is correct. If you need to run manually:

cd ~/homelab-codex-ws/services/ha-diag-agent
docker-compose up -d --build

Section 4: Verification

# Container is up
docker ps | grep ha-diag-agent

# Last 50 log lines
docker logs ha-diag-agent --tail 50

# FastAPI health endpoint
curl http://localhost:8087/health
# Expect: {"status": "ok", "ws_connected": true, ...}

# Events are being written
ls /opt/homelab/events/<node-name>/
# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds

# Supervisor is picking up events (check on VPS / control-plane)
tail -f /opt/homelab/logs/supervisor.log | grep ha_

Section 5: First-48h observation (shadow mode)

The supervisor starts with HA_DIAG_SHADOW_MODE=true (default). During this window, ha_websocket_dead events are downgraded to alert_only actions tagged [SHADOW MODE] rather than triggering an automatic restart.

Watch for these signals in Telegram:

[SHADOW MODE] would have triggered container_restart for homeassistant — confirms the detection path works end-to-end
ha_entity_unavailable_long / ha_integration_failed / etc. — these are always alert_only regardless of shadow mode; verify descriptions look accurate and thresholds are reasonable

Things to evaluate:

Question	Good sign
Are shadow alerts firing at reasonable frequency?	≤ 1 per 30 min per node
Are there false positives?	No alerts during known-good uptime
Are entity-unavailable alerts describing real entities?	Yes, names match HA UI
Are integration-failed alerts genuine?	Yes, not noise from startup

Note any false positives or noisy thresholds before enabling production mode.

Section 6: Enabling production mode

HA_DIAG_SHADOW_MODE is an environment variable read by the supervisor container. The VPS supervisor env vars live in the version-controlled override file at hosts/vps/runtime/control-plane/docker-compose.override.yml (not in a runtime .env file — the supervisor has no env_file: directive).

When the 48h observation period looks clean:

1. Edit the override file on SATURN:

# hosts/vps/runtime/control-plane/docker-compose.override.yml
services:
  supervisor:
    environment:
      - NODE_ALIAS_MAP={"node-2":"chelsty"}
      - HA_DIAG_SHADOW_MODE=false      # add this line

2. Commit and push from SATURN:

git add hosts/vps/runtime/control-plane/docker-compose.override.yml
git commit -m "feat(control-plane): disable HA shadow mode — production ready"
git push

3. Apply on VPS:

ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
  -f services/control-plane/docker-compose.yml \
  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
  up -d supervisor

4. Confirm:

docker logs control-plane-supervisor --tail 5
# Expect: shadow_mode=False — HA container_restart actions enabled

From this point, the next ha_websocket_dead event will generate a container_restart action in the approval queue. The 30-minute cooldown still applies after each restart.

Section 7: Rollback

If production mode causes unexpected behaviour:

# Option A — re-enable shadow mode
# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
# Commit, push, then on VPS:
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
  -f services/control-plane/docker-compose.yml \
  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
  up -d supervisor

# Option B — stop ha-diag-agent entirely on affected nodes
ssh oskar@<node>
docker stop ha-diag-agent

# Events written before rollback remain in /opt/homelab/events/<node>/
# and are historical only — no automated action will be taken on them
# unless the supervisor re-processes them, which it won't (already in
# _ha_processed_event_ids).

Any container_restart actions still in pending/ after rollback can be manually rejected via the Telegram bot or by deleting the action files from /opt/homelab/actions/pending/ on the VPS.

6.6 KiB Raw Blame History