From f3810232061959d639a1a604323a6266f858aaa5 Mon Sep 17 00:00:00 2001 From: Oskar Kapala Date: Mon, 1 Jun 2026 20:38:39 +0200 Subject: [PATCH] docs(claude): add Definition of Done for services (smoke test + pytest) Lesson from brain-watchdog: code that was never run had a packaging bug that caused a crash loop in production. New rule: docker build + short smoke-run + pytest before any commit or deploy. Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index ae31a58..50c8825 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -106,6 +106,48 @@ When exploring the system, use these files in order: 3. `hosts//services.yaml` — desired services and exposure classes for that host 4. `services//service.yaml` — operational contract for a service +## VPS-Specific Rules + +VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime//docker-compose.override.yml`. + +### Memory limit convention + +Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode): + +```yaml +services: + myservice: + mem_limit: 256m # cgroup ceiling; Docker restarts on breach + oom_score_adj: -900 # host kernel OOM-killer will not pick this container +``` + +Rules: +- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim. +- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded. +- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB). + +### Repo-managed services on VPS + +All VPS services are now GitOps-managed. Service definitions live in `services//docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime//docker-compose.override.yml`. + +| Service | Compose stack | Data path | +|---|---|---| +| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) | +| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` | +| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` | +| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` | + +**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan. + +**Cutover checklist** (before running `docker compose up` for any migrated service): +1. `git pull` on VPS +2. Populate `/opt/homelab/config//.env` from the `env.example` template +3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env` +4. For mosquitto: config stays at old bind path until explicitly migrated +5. Verify named volumes exist: `docker volume ls | grep ` + +**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius. + ## CHELSTY-Specific Rules - Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`. @@ -124,6 +166,14 @@ When exploring the system, use these files in order: - `world/` — Observer output (synthesized state) - `actions/` — pending / approved / running / completed / failed +## Definition of Done (serwisy) + +Before any new or changed service is considered ready: + +1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node. +2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services//tests/`. +3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first. + ## Naming Conventions - Hosts: ALL CAPS (`SATURN`, `PIHA`)