homelab-codex-ws/services/ha-diag-agent/DEPLOY.md

240 lines
6.6 KiB
Markdown
Raw Permalink Normal View History

# ha-diag-agent Deployment Guide
## Section 1: Prerequisites
### HA long-lived access token
The agent authenticates to Home Assistant with a long-lived token issued by a
dedicated service account. Do not use a personal admin token.
1. In HA: **Settings → People → Add Person**
- Name: `diag_agent`
- Do **not** add to any group (no admin rights needed)
2. Log in to HA as `diag_agent`
3. Go to **Profile → Long-Lived Access Tokens → Create token**
- Name: `ha-diag-agent`
- Copy the token — it is shown only once
4. Store the token in the node's `.env` file (see Section 2)
### Tailnet reachability check (chelsty-infra only)
`chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale.
Verify before deploying:
```bash
curl -sf http://100.70.180.90:8123/api/ \
-H "Authorization: Bearer <token>" | python3 -m json.tool
# Expect: {"message": "API running."}
```
If the request times out, check that both nodes are on the Tailscale mesh
(`tailscale status`) and that `chelsty-ha` is powered on.
---
## Section 2: Per-host config
Create `/opt/homelab/config/ha-diag-agent/.env` on **each target node**:
### piha
```bash
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://localhost:8123
HA_TOKEN=<long-lived-token-for-piha>
NODE_NAME=piha
LOCATION_TAG=ken
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
```
### chelsty-infra
```bash
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://100.70.180.90:8123
HA_TOKEN=<long-lived-token-for-chelsty-ha>
NODE_NAME=chelsty-infra
LOCATION_TAG=chelsty
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
```
> If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and
> restart the container.
---
## Section 3: Deploy procedure
### From SATURN (standard flow)
```bash
# 1. Commit and push changes from SATURN
git push
# 2. SSH to target node
ssh oskar@piha # or chelsty-infra
# 3. Pull latest and deploy
cd ~/homelab-codex-ws
git pull
scripts/deploy/deploy.sh --service ha-diag-agent
```
### chelsty-infra (docker-compose v1)
`chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls
`docker-compose` (hyphenated), which is correct. If you need to run manually:
```bash
cd ~/homelab-codex-ws/services/ha-diag-agent
docker-compose up -d --build
```
---
## Section 4: Verification
```bash
# Container is up
docker ps | grep ha-diag-agent
# Last 50 log lines
docker logs ha-diag-agent --tail 50
# FastAPI health endpoint
curl http://localhost:8087/health
# Expect: {"status": "ok", "ws_connected": true, ...}
# Events are being written
ls /opt/homelab/events/<node-name>/
# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds
# Supervisor is picking up events (check on VPS / control-plane)
tail -f /opt/homelab/logs/supervisor.log | grep ha_
```
---
## Section 5: First-48h observation (shadow mode)
The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this
window, `ha_websocket_dead` events are downgraded to `alert_only` actions
tagged `[SHADOW MODE]` rather than triggering an automatic restart.
Watch for these signals in Telegram:
- `[SHADOW MODE] would have triggered container_restart for homeassistant`
confirms the detection path works end-to-end
- `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are
always `alert_only` regardless of shadow mode; verify descriptions look
accurate and thresholds are reasonable
Things to evaluate:
| Question | Good sign |
|----------|-----------|
| Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node |
| Are there false positives? | No alerts during known-good uptime |
| Are entity-unavailable alerts describing real entities? | Yes, names match HA UI |
| Are integration-failed alerts genuine? | Yes, not noise from startup |
Note any false positives or noisy thresholds before enabling production mode.
---
## Section 6: Enabling production mode
`HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor
container. The VPS supervisor env vars live in the version-controlled
override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml`
(not in a runtime `.env` file — the supervisor has no `env_file:` directive).
When the 48h observation period looks clean:
**1. Edit the override file on SATURN:**
```yaml
# hosts/vps/runtime/control-plane/docker-compose.override.yml
services:
supervisor:
environment:
- NODE_ALIAS_MAP={"node-2":"chelsty"}
- HA_DIAG_SHADOW_MODE=false # add this line
```
**2. Commit and push from SATURN:**
```bash
git add hosts/vps/runtime/control-plane/docker-compose.override.yml
git commit -m "feat(control-plane): disable HA shadow mode — production ready"
git push
```
**3. Apply on VPS:**
```bash
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
-f services/control-plane/docker-compose.yml \
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
up -d supervisor
```
**4. Confirm:**
```bash
docker logs control-plane-supervisor --tail 5
# Expect: shadow_mode=False — HA container_restart actions enabled
```
From this point, the next `ha_websocket_dead` event will generate a
`container_restart` action in the approval queue. The 30-minute cooldown
still applies after each restart.
---
## Section 7: Rollback
If production mode causes unexpected behaviour:
```bash
# Option A — re-enable shadow mode
# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
# Commit, push, then on VPS:
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
-f services/control-plane/docker-compose.yml \
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
up -d supervisor
# Option B — stop ha-diag-agent entirely on affected nodes
ssh oskar@<node>
docker stop ha-diag-agent
# Events written before rollback remain in /opt/homelab/events/<node>/
# and are historical only — no automated action will be taken on them
# unless the supervisor re-processes them, which it won't (already in
# _ha_processed_event_ids).
```
Any `container_restart` actions still in `pending/` after rollback can be
manually rejected via the Telegram bot or by deleting the action files from
`/opt/homelab/actions/pending/` on the VPS.