240 lines
6.6 KiB
Markdown
240 lines
6.6 KiB
Markdown
|
|
# ha-diag-agent Deployment Guide
|
||
|
|
|
||
|
|
## Section 1: Prerequisites
|
||
|
|
|
||
|
|
### HA long-lived access token
|
||
|
|
|
||
|
|
The agent authenticates to Home Assistant with a long-lived token issued by a
|
||
|
|
dedicated service account. Do not use a personal admin token.
|
||
|
|
|
||
|
|
1. In HA: **Settings → People → Add Person**
|
||
|
|
- Name: `diag_agent`
|
||
|
|
- Do **not** add to any group (no admin rights needed)
|
||
|
|
2. Log in to HA as `diag_agent`
|
||
|
|
3. Go to **Profile → Long-Lived Access Tokens → Create token**
|
||
|
|
- Name: `ha-diag-agent`
|
||
|
|
- Copy the token — it is shown only once
|
||
|
|
4. Store the token in the node's `.env` file (see Section 2)
|
||
|
|
|
||
|
|
### Tailnet reachability check (chelsty-infra only)
|
||
|
|
|
||
|
|
`chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale.
|
||
|
|
Verify before deploying:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -sf http://100.70.180.90:8123/api/ \
|
||
|
|
-H "Authorization: Bearer <token>" | python3 -m json.tool
|
||
|
|
# Expect: {"message": "API running."}
|
||
|
|
```
|
||
|
|
|
||
|
|
If the request times out, check that both nodes are on the Tailscale mesh
|
||
|
|
(`tailscale status`) and that `chelsty-ha` is powered on.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Section 2: Per-host config
|
||
|
|
|
||
|
|
Create `/opt/homelab/config/ha-diag-agent/.env` on **each target node**:
|
||
|
|
|
||
|
|
### piha
|
||
|
|
|
||
|
|
```bash
|
||
|
|
mkdir -p /opt/homelab/config/ha-diag-agent
|
||
|
|
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
|
||
|
|
HA_URL=http://localhost:8123
|
||
|
|
HA_TOKEN=<long-lived-token-for-piha>
|
||
|
|
NODE_NAME=piha
|
||
|
|
LOCATION_TAG=ken
|
||
|
|
CHECK_INTERVAL=60
|
||
|
|
CHECK_INTERVAL_UNAVAILABLE=3600
|
||
|
|
UNAVAILABLE_THRESHOLD_HOURS=24
|
||
|
|
ALERT_COOLDOWN_HOURS=6
|
||
|
|
LOG_LEVEL=info
|
||
|
|
EOF
|
||
|
|
chmod 600 /opt/homelab/config/ha-diag-agent/.env
|
||
|
|
```
|
||
|
|
|
||
|
|
### chelsty-infra
|
||
|
|
|
||
|
|
```bash
|
||
|
|
mkdir -p /opt/homelab/config/ha-diag-agent
|
||
|
|
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
|
||
|
|
HA_URL=http://100.70.180.90:8123
|
||
|
|
HA_TOKEN=<long-lived-token-for-chelsty-ha>
|
||
|
|
NODE_NAME=chelsty-infra
|
||
|
|
LOCATION_TAG=chelsty
|
||
|
|
CHECK_INTERVAL=60
|
||
|
|
CHECK_INTERVAL_UNAVAILABLE=3600
|
||
|
|
UNAVAILABLE_THRESHOLD_HOURS=24
|
||
|
|
ALERT_COOLDOWN_HOURS=6
|
||
|
|
LOG_LEVEL=info
|
||
|
|
EOF
|
||
|
|
chmod 600 /opt/homelab/config/ha-diag-agent/.env
|
||
|
|
```
|
||
|
|
|
||
|
|
> If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and
|
||
|
|
> restart the container.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Section 3: Deploy procedure
|
||
|
|
|
||
|
|
### From SATURN (standard flow)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Commit and push changes from SATURN
|
||
|
|
git push
|
||
|
|
|
||
|
|
# 2. SSH to target node
|
||
|
|
ssh oskar@piha # or chelsty-infra
|
||
|
|
|
||
|
|
# 3. Pull latest and deploy
|
||
|
|
cd ~/homelab-codex-ws
|
||
|
|
git pull
|
||
|
|
scripts/deploy/deploy.sh --service ha-diag-agent
|
||
|
|
```
|
||
|
|
|
||
|
|
### chelsty-infra (docker-compose v1)
|
||
|
|
|
||
|
|
`chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls
|
||
|
|
`docker-compose` (hyphenated), which is correct. If you need to run manually:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd ~/homelab-codex-ws/services/ha-diag-agent
|
||
|
|
docker-compose up -d --build
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Section 4: Verification
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Container is up
|
||
|
|
docker ps | grep ha-diag-agent
|
||
|
|
|
||
|
|
# Last 50 log lines
|
||
|
|
docker logs ha-diag-agent --tail 50
|
||
|
|
|
||
|
|
# FastAPI health endpoint
|
||
|
|
curl http://localhost:8087/health
|
||
|
|
# Expect: {"status": "ok", "ws_connected": true, ...}
|
||
|
|
|
||
|
|
# Events are being written
|
||
|
|
ls /opt/homelab/events/<node-name>/
|
||
|
|
# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds
|
||
|
|
|
||
|
|
# Supervisor is picking up events (check on VPS / control-plane)
|
||
|
|
tail -f /opt/homelab/logs/supervisor.log | grep ha_
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Section 5: First-48h observation (shadow mode)
|
||
|
|
|
||
|
|
The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this
|
||
|
|
window, `ha_websocket_dead` events are downgraded to `alert_only` actions
|
||
|
|
tagged `[SHADOW MODE]` rather than triggering an automatic restart.
|
||
|
|
|
||
|
|
Watch for these signals in Telegram:
|
||
|
|
|
||
|
|
- `[SHADOW MODE] would have triggered container_restart for homeassistant` —
|
||
|
|
confirms the detection path works end-to-end
|
||
|
|
- `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are
|
||
|
|
always `alert_only` regardless of shadow mode; verify descriptions look
|
||
|
|
accurate and thresholds are reasonable
|
||
|
|
|
||
|
|
Things to evaluate:
|
||
|
|
|
||
|
|
| Question | Good sign |
|
||
|
|
|----------|-----------|
|
||
|
|
| Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node |
|
||
|
|
| Are there false positives? | No alerts during known-good uptime |
|
||
|
|
| Are entity-unavailable alerts describing real entities? | Yes, names match HA UI |
|
||
|
|
| Are integration-failed alerts genuine? | Yes, not noise from startup |
|
||
|
|
|
||
|
|
Note any false positives or noisy thresholds before enabling production mode.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Section 6: Enabling production mode
|
||
|
|
|
||
|
|
`HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor
|
||
|
|
container. The VPS supervisor env vars live in the version-controlled
|
||
|
|
override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml`
|
||
|
|
(not in a runtime `.env` file — the supervisor has no `env_file:` directive).
|
||
|
|
|
||
|
|
When the 48h observation period looks clean:
|
||
|
|
|
||
|
|
**1. Edit the override file on SATURN:**
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# hosts/vps/runtime/control-plane/docker-compose.override.yml
|
||
|
|
services:
|
||
|
|
supervisor:
|
||
|
|
environment:
|
||
|
|
- NODE_ALIAS_MAP={"node-2":"chelsty"}
|
||
|
|
- HA_DIAG_SHADOW_MODE=false # add this line
|
||
|
|
```
|
||
|
|
|
||
|
|
**2. Commit and push from SATURN:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
git add hosts/vps/runtime/control-plane/docker-compose.override.yml
|
||
|
|
git commit -m "feat(control-plane): disable HA shadow mode — production ready"
|
||
|
|
git push
|
||
|
|
```
|
||
|
|
|
||
|
|
**3. Apply on VPS:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ssh oskar@100.95.58.48
|
||
|
|
cd ~/homelab-codex-ws && git pull
|
||
|
|
docker compose \
|
||
|
|
-f services/control-plane/docker-compose.yml \
|
||
|
|
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
|
||
|
|
up -d supervisor
|
||
|
|
```
|
||
|
|
|
||
|
|
**4. Confirm:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker logs control-plane-supervisor --tail 5
|
||
|
|
# Expect: shadow_mode=False — HA container_restart actions enabled
|
||
|
|
```
|
||
|
|
|
||
|
|
From this point, the next `ha_websocket_dead` event will generate a
|
||
|
|
`container_restart` action in the approval queue. The 30-minute cooldown
|
||
|
|
still applies after each restart.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Section 7: Rollback
|
||
|
|
|
||
|
|
If production mode causes unexpected behaviour:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Option A — re-enable shadow mode
|
||
|
|
# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
|
||
|
|
# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
|
||
|
|
# Commit, push, then on VPS:
|
||
|
|
ssh oskar@100.95.58.48
|
||
|
|
cd ~/homelab-codex-ws && git pull
|
||
|
|
docker compose \
|
||
|
|
-f services/control-plane/docker-compose.yml \
|
||
|
|
-f hosts/vps/runtime/control-plane/docker-compose.override.yml \
|
||
|
|
up -d supervisor
|
||
|
|
|
||
|
|
# Option B — stop ha-diag-agent entirely on affected nodes
|
||
|
|
ssh oskar@<node>
|
||
|
|
docker stop ha-diag-agent
|
||
|
|
|
||
|
|
# Events written before rollback remain in /opt/homelab/events/<node>/
|
||
|
|
# and are historical only — no automated action will be taken on them
|
||
|
|
# unless the supervisor re-processes them, which it won't (already in
|
||
|
|
# _ha_processed_event_ids).
|
||
|
|
```
|
||
|
|
|
||
|
|
Any `container_restart` actions still in `pending/` after rollback can be
|
||
|
|
manually rejected via the Telegram bot or by deleting the action files from
|
||
|
|
`/opt/homelab/actions/pending/` on the VPS.
|