homelab-codex-ws/services/ha-diag-agent/DEPLOY.md

# ha-diag-agent Deployment Guide

## Section 1: Prerequisites

### HA long-lived access token

The agent authenticates to Home Assistant with a long-lived token issued by a
dedicated service account. Do not use a personal admin token.

1. In HA: **Settings → People → Add Person**
   - Name: `diag_agent`
   - Do **not** add to any group (no admin rights needed)
2. Log in to HA as `diag_agent`
3. Go to **Profile → Long-Lived Access Tokens → Create token**
   - Name: `ha-diag-agent`
   - Copy the token — it is shown only once
4. Store the token in the node's `.env` file (see Section 2)

### Tailnet reachability check (chelsty-infra only)

`chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale.
Verify before deploying:

```bash
curl -sf http://100.70.180.90:8123/api/ \
  -H "Authorization: Bearer <token>" | python3 -m json.tool
# Expect: {"message": "API running."}
```

If the request times out, check that both nodes are on the Tailscale mesh
(`tailscale status`) and that `chelsty-ha` is powered on.

---

## Section 2: Per-host config

Create `/opt/homelab/config/ha-diag-agent/.env` on **each target node**:

### piha

```bash
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://localhost:8123
HA_TOKEN=<long-lived-token-for-piha>
NODE_NAME=piha
LOCATION_TAG=ken
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
```

### chelsty-infra

```bash
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://100.70.180.90:8123
HA_TOKEN=<long-lived-token-for-chelsty-ha>
NODE_NAME=chelsty-infra
LOCATION_TAG=chelsty
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
```

> If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and
> restart the container.

---

## Section 3: Deploy procedure

### From SATURN (standard flow)

```bash
# 1. Commit and push changes from SATURN
git push

# 2. SSH to target node
ssh oskar@piha            # or chelsty-infra

# 3. Pull latest and deploy
cd ~/homelab-codex-ws
git pull
scripts/deploy/deploy.sh --service ha-diag-agent
```

### chelsty-infra (docker-compose v1)

`chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls
`docker-compose` (hyphenated), which is correct. If you need to run manually:

```bash
cd ~/homelab-codex-ws/services/ha-diag-agent
docker-compose up -d --build
```

---

## Section 4: Verification

```bash
# Container is up
docker ps | grep ha-diag-agent

# Last 50 log lines
docker logs ha-diag-agent --tail 50

# FastAPI health endpoint
curl http://localhost:8087/health
# Expect: {"status": "ok", "ws_connected": true, ...}

# Events are being written
ls /opt/homelab/events/<node-name>/
# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds

# Supervisor is picking up events (check on VPS / control-plane)
tail -f /opt/homelab/logs/supervisor.log | grep ha_
```

---

## Section 5: First-48h observation (shadow mode)

The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this
window, `ha_websocket_dead` events are downgraded to `alert_only` actions
tagged `[SHADOW MODE]` rather than triggering an automatic restart.

Watch for these signals in Telegram:

- `[SHADOW MODE] would have triggered container_restart for homeassistant` —
  confirms the detection path works end-to-end
- `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are
  always `alert_only` regardless of shadow mode; verify descriptions look
  accurate and thresholds are reasonable

Things to evaluate:

| Question | Good sign |
|----------|-----------|
| Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node |
| Are there false positives? | No alerts during known-good uptime |
| Are entity-unavailable alerts describing real entities? | Yes, names match HA UI |
| Are integration-failed alerts genuine? | Yes, not noise from startup |

Note any false positives or noisy thresholds before enabling production mode.

---

## Section 6: Enabling production mode

`HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor
container. The VPS supervisor env vars live in the version-controlled
override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml`
(not in a runtime `.env` file — the supervisor has no `env_file:` directive).

When the 48h observation period looks clean:

**1. Edit the override file on SATURN:**

```yaml
# hosts/vps/runtime/control-plane/docker-compose.override.yml
services:
  supervisor:
    environment:
      - NODE_ALIAS_MAP={"node-2":"chelsty"}
      - HA_DIAG_SHADOW_MODE=false      # add this line
```

**2. Commit and push from SATURN:**

```bash
git add hosts/vps/runtime/control-plane/docker-compose.override.yml
git commit -m "feat(control-plane): disable HA shadow mode — production ready"
git push
```

**3. Apply on VPS:**

```bash
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
  -f services/control-plane/docker-compose.yml \
  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
  up -d supervisor
```

**4. Confirm:**

```bash
docker logs control-plane-supervisor --tail 5
# Expect: shadow_mode=False — HA container_restart actions enabled
```

From this point, the next `ha_websocket_dead` event will generate a
`container_restart` action in the approval queue. The 30-minute cooldown
still applies after each restart.

---

## Section 7: Rollback

If production mode causes unexpected behaviour:

```bash
# Option A — re-enable shadow mode
# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
# Commit, push, then on VPS:
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
  -f services/control-plane/docker-compose.yml \
  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
  up -d supervisor

# Option B — stop ha-diag-agent entirely on affected nodes
ssh oskar@<node>
docker stop ha-diag-agent

# Events written before rollback remain in /opt/homelab/events/<node>/
# and are historical only — no automated action will be taken on them
# unless the supervisor re-processes them, which it won't (already in
# _ha_processed_event_ids).
```

Any `container_restart` actions still in `pending/` after rollback can be
manually rejected via the Telegram bot or by deleting the action files from
`/opt/homelab/actions/pending/` on the VPS.
feat(control-plane): shadow_mode for HA event auto-actions + deploy docs - HA_DIAG_SHADOW_MODE env flag in supervisor (default true) - shadow_mode downgrades container_restart actions to alert_only with [SHADOW MODE] note; same action_id and 30-min cooldown apply - alert_only events unaffected (always routed normally) - 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected - DEPLOY.md with token gen, per-host config, verification, 48h observation, production-mode enablement, rollback - README.md updated with shadow mode flag summary and DEPLOY.md link Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-05-29 17:04:39 +02:00			`# ha-diag-agent Deployment Guide`

			`## Section 1: Prerequisites`

			`### HA long-lived access token`

			`The agent authenticates to Home Assistant with a long-lived token issued by a`
			`dedicated service account. Do not use a personal admin token.`

			`1. In HA: Settings → People → Add Person`
			- Name: `diag_agent`
			`- Do not add to any group (no admin rights needed)`
			2. Log in to HA as `diag_agent`
			`3. Go to Profile → Long-Lived Access Tokens → Create token`
			- Name: `ha-diag-agent`
			`- Copy the token — it is shown only once`
			4. Store the token in the node's `.env` file (see Section 2)

			`### Tailnet reachability check (chelsty-infra only)`

			`chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale.
			`Verify before deploying:`

			```bash
			`curl -sf http://100.70.180.90:8123/api/ \`
			`-H "Authorization: Bearer <token>" \| python3 -m json.tool`
			`# Expect: {"message": "API running."}`
			```

			`If the request times out, check that both nodes are on the Tailscale mesh`
			(`tailscale status`) and that `chelsty-ha` is powered on.

			`---`

			`## Section 2: Per-host config`

			Create `/opt/homelab/config/ha-diag-agent/.env` on each target node:

			`### piha`

			```bash
			`mkdir -p /opt/homelab/config/ha-diag-agent`
			`cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'`
			`HA_URL=http://localhost:8123`
			`HA_TOKEN=<long-lived-token-for-piha>`
			`NODE_NAME=piha`
			`LOCATION_TAG=ken`
			`CHECK_INTERVAL=60`
			`CHECK_INTERVAL_UNAVAILABLE=3600`
			`UNAVAILABLE_THRESHOLD_HOURS=24`
			`ALERT_COOLDOWN_HOURS=6`
			`LOG_LEVEL=info`
			`EOF`
			`chmod 600 /opt/homelab/config/ha-diag-agent/.env`
			```

			`### chelsty-infra`

			```bash
			`mkdir -p /opt/homelab/config/ha-diag-agent`
			`cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'`
			`HA_URL=http://100.70.180.90:8123`
			`HA_TOKEN=<long-lived-token-for-chelsty-ha>`
			`NODE_NAME=chelsty-infra`
			`LOCATION_TAG=chelsty`
			`CHECK_INTERVAL=60`
			`CHECK_INTERVAL_UNAVAILABLE=3600`
			`UNAVAILABLE_THRESHOLD_HOURS=24`
			`ALERT_COOLDOWN_HOURS=6`
			`LOG_LEVEL=info`
			`EOF`
			`chmod 600 /opt/homelab/config/ha-diag-agent/.env`
			```

			> If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and
			`> restart the container.`

			`---`

			`## Section 3: Deploy procedure`

			`### From SATURN (standard flow)`

			```bash
			`# 1. Commit and push changes from SATURN`
			`git push`

			`# 2. SSH to target node`
			`ssh oskar@piha # or chelsty-infra`

			`# 3. Pull latest and deploy`
			`cd ~/homelab-codex-ws`
			`git pull`
			`scripts/deploy/deploy.sh --service ha-diag-agent`
			```

			`### chelsty-infra (docker-compose v1)`

			`chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls
			`docker-compose` (hyphenated), which is correct. If you need to run manually:

			```bash
			`cd ~/homelab-codex-ws/services/ha-diag-agent`
			`docker-compose up -d --build`
			```

			`---`

			`## Section 4: Verification`

			```bash
			`# Container is up`
			`docker ps \| grep ha-diag-agent`

			`# Last 50 log lines`
			`docker logs ha-diag-agent --tail 50`

			`# FastAPI health endpoint`
			`curl http://localhost:8087/health`
			`# Expect: {"status": "ok", "ws_connected": true, ...}`

			`# Events are being written`
			`ls /opt/homelab/events/<node-name>/`
			`# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds`

			`# Supervisor is picking up events (check on VPS / control-plane)`
			`tail -f /opt/homelab/logs/supervisor.log \| grep ha_`
			```

			`---`

			`## Section 5: First-48h observation (shadow mode)`

			The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this
			window, `ha_websocket_dead` events are downgraded to `alert_only` actions
			tagged `[SHADOW MODE]` rather than triggering an automatic restart.

			`Watch for these signals in Telegram:`

			- `[SHADOW MODE] would have triggered container_restart for homeassistant` —
			`confirms the detection path works end-to-end`
			- `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are
			always `alert_only` regardless of shadow mode; verify descriptions look
			`accurate and thresholds are reasonable`

			`Things to evaluate:`

			`\| Question \| Good sign \|`
			`\|----------\|-----------\|`
			`\| Are shadow alerts firing at reasonable frequency? \| ≤ 1 per 30 min per node \|`
			`\| Are there false positives? \| No alerts during known-good uptime \|`
			`\| Are entity-unavailable alerts describing real entities? \| Yes, names match HA UI \|`
			`\| Are integration-failed alerts genuine? \| Yes, not noise from startup \|`

			`Note any false positives or noisy thresholds before enabling production mode.`

			`---`

			`## Section 6: Enabling production mode`

			`HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor
			`container. The VPS supervisor env vars live in the version-controlled`
			override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml`
			(not in a runtime `.env` file — the supervisor has no `env_file:` directive).

			`When the 48h observation period looks clean:`

			`1. Edit the override file on SATURN:`

			```yaml
			`# hosts/vps/runtime/control-plane/docker-compose.override.yml`
			`services:`
			`supervisor:`
			`environment:`
			`- NODE_ALIAS_MAP={"node-2":"chelsty"}`
			`- HA_DIAG_SHADOW_MODE=false # add this line`
			```

			`2. Commit and push from SATURN:`

			```bash
			`git add hosts/vps/runtime/control-plane/docker-compose.override.yml`
			`git commit -m "feat(control-plane): disable HA shadow mode — production ready"`
			`git push`
			```

			`3. Apply on VPS:`

			```bash
			`ssh oskar@100.95.58.48`
			`cd ~/homelab-codex-ws && git pull`
			`docker compose \`
			`-f services/control-plane/docker-compose.yml \`
			`-f hosts/vps/runtime/control-plane/docker-compose.override.yml \`
			`up -d supervisor`
			```

			`4. Confirm:`

			```bash
			`docker logs control-plane-supervisor --tail 5`
			`# Expect: shadow_mode=False — HA container_restart actions enabled`
			```

			From this point, the next `ha_websocket_dead` event will generate a
			`container_restart` action in the approval queue. The 30-minute cooldown
			`still applies after each restart.`

			`---`

			`## Section 7: Rollback`

			`If production mode causes unexpected behaviour:`

			```bash
			`# Option A — re-enable shadow mode`
			`# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml`
			`# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)`
			`# Commit, push, then on VPS:`
			`ssh oskar@100.95.58.48`
			`cd ~/homelab-codex-ws && git pull`
			`docker compose \`
			`-f services/control-plane/docker-compose.yml \`
			`-f hosts/vps/runtime/control-plane/docker-compose.override.yml \`
			`up -d supervisor`

			`# Option B — stop ha-diag-agent entirely on affected nodes`
			`ssh oskar@<node>`
			`docker stop ha-diag-agent`

			`# Events written before rollback remain in /opt/homelab/events/<node>/`
			`# and are historical only — no automated action will be taken on them`
			`# unless the supervisor re-processes them, which it won't (already in`
			`# _ha_processed_event_ids).`
			```

			Any `container_restart` actions still in `pending/` after rollback can be
			`manually rejected via the Telegram bot or by deleting the action files from`
			`/opt/homelab/actions/pending/` on the VPS.