homelab-codex-ws/CLAUDE.md

183 lines
10 KiB
Markdown
Raw Normal View History

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## What This Repo Is
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
## Node Roles
| Host | Role |
|------|------|
| **SATURN** | Primary control node — only node where commits are made |
| **SOLARIA** | GPU/compute/AI workloads |
| **PIHA** | Infra, monitoring |
| **VPS** | Public ingress, reverse proxy, control plane host |
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
## Deployment
```bash
scripts/deploy/deploy.sh # fresh deploy on current node
scripts/deploy/deploy.sh --resume # resume after interruption
scripts/deploy/deploy.sh --stage verify # specific stage only
scripts/deploy/deploy.sh --service mosquitto # specific service only
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
./scripts/bootstrap/prepare-node.sh # general node bootstrap
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
```
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
## Service Structure
Every service must follow this layout:
```
services/<service>/
├── docker-compose.yml
├── service.yaml # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example # Template — never commit actual secrets
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
```
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
## Agent System Architecture
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
4. **Executor** — Executes actions only after they transition to `approved`.
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
### Action approval flow
```
Agent → /opt/homelab/actions/pending/<id>.json
→ Telegram notification → Operator approves
→ /opt/homelab/actions/approved/<id>.json
→ Executor runs → completed / failed
```
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
## Event System
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
### Supervisor event routing table
| Event type | Source | Action generated | Cooldown |
|---|---|---|---|
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
## Discovery Entry Points for Agents
When exploring the system, use these files in order:
1. `inventory/topology.yaml` — node list, roles, mesh type
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
4. `services/<service>/service.yaml` — operational contract for a service
## VPS-Specific Rules
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
### Memory limit convention
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
```yaml
services:
myservice:
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
```
Rules:
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
### Repo-managed services on VPS
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
| Service | Compose stack | Data path |
|---|---|---|
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
**Cutover checklist** (before running `docker compose up` for any migrated service):
1. `git pull` on VPS
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
4. For mosquitto: config stays at old bind path until explicitly migrated
5. Verify named volumes exist: `docker volume ls | grep <project>`
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
## CHELSTY-Specific Rules
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
## Runtime Path Conventions
`/opt/homelab/` layout on each node:
- `data/<service>/` — persistent volumes
- `config/<service>/` — secrets and host-local overrides (not in Git)
- `logs/<service>/` — service logs
- `state/` — deployment stage markers, agent heartbeats
- `events/` — append-only event store
- `world/` — Observer output (synthesized state)
- `actions/` — pending / approved / running / completed / failed
## Definition of Done (serwisy)
Before any new or changed service is considered ready:
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
## Naming Conventions
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
- Container names must match service names
- Always `restart: unless-stopped` unless `service.yaml` says otherwise