# ha-diag-agent Deployment Guide

## Section 1: Prerequisites

### HA long-lived access token

The agent authenticates to Home Assistant with a long-lived token issued by a
dedicated service account. Do not use a personal admin token.

1. In HA: **Settings → People → Add Person**
   - Name: `diag_agent`
   - Do **not** add to any group (no admin rights needed)
2. Log in to HA as `diag_agent`
3. Go to **Profile → Long-Lived Access Tokens → Create token**
   - Name: `ha-diag-agent`
   - Copy the token — it is shown only once
4. Store the token in the node's `.env` file (see Section 2)

### Tailnet reachability check (chelsty-infra only)

`chelsty-infra` reaches Home Assistant on `chelsty-ha` over Tailscale.
Verify before deploying:

```bash
curl -sf http://100.70.180.90:8123/api/ \
  -H "Authorization: Bearer <token>" | python3 -m json.tool
# Expect: {"message": "API running."}
```

If the request times out, check that both nodes are on the Tailscale mesh
(`tailscale status`) and that `chelsty-ha` is powered on.

---

## Section 2: Per-host config

Create `/opt/homelab/config/ha-diag-agent/.env` on **each target node**:

### piha

```bash
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://localhost:8123
HA_TOKEN=<long-lived-token-for-piha>
NODE_NAME=piha
LOCATION_TAG=ken
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
```

### chelsty-infra

```bash
mkdir -p /opt/homelab/config/ha-diag-agent
cat > /opt/homelab/config/ha-diag-agent/.env << 'EOF'
HA_URL=http://100.70.180.90:8123
HA_TOKEN=<long-lived-token-for-chelsty-ha>
NODE_NAME=chelsty-infra
LOCATION_TAG=chelsty
CHECK_INTERVAL=60
CHECK_INTERVAL_UNAVAILABLE=3600
UNAVAILABLE_THRESHOLD_HOURS=24
ALERT_COOLDOWN_HOURS=6
LOG_LEVEL=info
EOF
chmod 600 /opt/homelab/config/ha-diag-agent/.env
```

> If `chelsty-ha` gets a new Tailscale IP, update `HA_URL` in this file and
> restart the container.

---

## Section 3: Deploy procedure

### From SATURN (standard flow)

```bash
# 1. Commit and push changes from SATURN
git push

# 2. SSH to target node
ssh oskar@piha            # or chelsty-infra

# 3. Pull latest and deploy
cd ~/homelab-codex-ws
git pull
scripts/deploy/deploy.sh --service ha-diag-agent
```

### chelsty-infra (docker-compose v1)

`chelsty-infra` runs docker-compose v1 (1.29.2). The deploy script calls
`docker-compose` (hyphenated), which is correct. If you need to run manually:

```bash
cd ~/homelab-codex-ws/services/ha-diag-agent
docker-compose up -d --build
```

---

## Section 4: Verification

```bash
# Container is up
docker ps | grep ha-diag-agent

# Last 50 log lines
docker logs ha-diag-agent --tail 50

# FastAPI health endpoint
curl http://localhost:8087/health
# Expect: {"status": "ok", "ws_connected": true, ...}

# Events are being written
ls /opt/homelab/events/<node-name>/
# Expect: ha_*.json files appearing within the first CHECK_INTERVAL seconds

# Supervisor is picking up events (check on VPS / control-plane)
tail -f /opt/homelab/logs/supervisor.log | grep ha_
```

---

## Section 5: First-48h observation (shadow mode)

The supervisor starts with `HA_DIAG_SHADOW_MODE=true` (default). During this
window, `ha_websocket_dead` events are downgraded to `alert_only` actions
tagged `[SHADOW MODE]` rather than triggering an automatic restart.

Watch for these signals in Telegram:

- `[SHADOW MODE] would have triggered container_restart for homeassistant` —
  confirms the detection path works end-to-end
- `ha_entity_unavailable_long` / `ha_integration_failed` / etc. — these are
  always `alert_only` regardless of shadow mode; verify descriptions look
  accurate and thresholds are reasonable

Things to evaluate:

| Question | Good sign |
|----------|-----------|
| Are shadow alerts firing at reasonable frequency? | ≤ 1 per 30 min per node |
| Are there false positives? | No alerts during known-good uptime |
| Are entity-unavailable alerts describing real entities? | Yes, names match HA UI |
| Are integration-failed alerts genuine? | Yes, not noise from startup |

Note any false positives or noisy thresholds before enabling production mode.

---

## Section 6: Enabling production mode

`HA_DIAG_SHADOW_MODE` is an environment variable read by the supervisor
container. The VPS supervisor env vars live in the version-controlled
override file at `hosts/vps/runtime/control-plane/docker-compose.override.yml`
(not in a runtime `.env` file — the supervisor has no `env_file:` directive).

When the 48h observation period looks clean:

**1. Edit the override file on SATURN:**

```yaml
# hosts/vps/runtime/control-plane/docker-compose.override.yml
services:
  supervisor:
    environment:
      - NODE_ALIAS_MAP={"node-2":"chelsty"}
      - HA_DIAG_SHADOW_MODE=false      # add this line
```

**2. Commit and push from SATURN:**

```bash
git add hosts/vps/runtime/control-plane/docker-compose.override.yml
git commit -m "feat(control-plane): disable HA shadow mode — production ready"
git push
```

**3. Apply on VPS:**

```bash
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
  -f services/control-plane/docker-compose.yml \
  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
  up -d supervisor
```

**4. Confirm:**

```bash
docker logs control-plane-supervisor --tail 5
# Expect: shadow_mode=False — HA container_restart actions enabled
```

From this point, the next `ha_websocket_dead` event will generate a
`container_restart` action in the approval queue. The 30-minute cooldown
still applies after each restart.

---

## Section 7: Rollback

If production mode causes unexpected behaviour:

```bash
# Option A — re-enable shadow mode
# On SATURN: edit hosts/vps/runtime/control-plane/docker-compose.override.yml
# Set HA_DIAG_SHADOW_MODE=true (or remove the line — default is true)
# Commit, push, then on VPS:
ssh oskar@100.95.58.48
cd ~/homelab-codex-ws && git pull
docker compose \
  -f services/control-plane/docker-compose.yml \
  -f hosts/vps/runtime/control-plane/docker-compose.override.yml \
  up -d supervisor

# Option B — stop ha-diag-agent entirely on affected nodes
ssh oskar@<node>
docker stop ha-diag-agent

# Events written before rollback remain in /opt/homelab/events/<node>/
# and are historical only — no automated action will be taken on them
# unless the supervisor re-processes them, which it won't (already in
# _ha_processed_event_ids).
```

Any `container_restart` actions still in `pending/` after rollback can be
manually rejected via the Telegram bot or by deleting the action files from
`/opt/homelab/actions/pending/` on the VPS.