Compare commits

..

2 commits

Author SHA1 Message Date
Oskar Kapala f381023206 docs(claude): add Definition of Done for services (smoke test + pytest)
Lesson from brain-watchdog: code that was never run had a packaging bug
that caused a crash loop in production. New rule: docker build + short
smoke-run + pytest before any commit or deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:38:39 +02:00
Oskar Kapala cb4ae756ab test(brain-watchdog): add pytest suite covering import and check() logic
7 cases: package importable, fresh ok, stale, unreachable, HTTP error,
missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src
so tests run without PYTHONPATH set in the environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:38:24 +02:00
4 changed files with 119 additions and 0 deletions

View file

@ -106,6 +106,48 @@ When exploring the system, use these files in order:
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
4. `services/<service>/service.yaml` — operational contract for a service
## VPS-Specific Rules
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
### Memory limit convention
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
```yaml
services:
myservice:
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
```
Rules:
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
### Repo-managed services on VPS
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
| Service | Compose stack | Data path |
|---|---|---|
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
**Cutover checklist** (before running `docker compose up` for any migrated service):
1. `git pull` on VPS
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
4. For mosquitto: config stays at old bind path until explicitly migrated
5. Verify named volumes exist: `docker volume ls | grep <project>`
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
## CHELSTY-Specific Rules
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
@ -124,6 +166,14 @@ When exploring the system, use these files in order:
- `world/` — Observer output (synthesized state)
- `actions/` — pending / approved / running / completed / failed
## Definition of Done (serwisy)
Before any new or changed service is considered ready:
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
## Naming Conventions
- Hosts: ALL CAPS (`SATURN`, `PIHA`)

View file

@ -0,0 +1,3 @@
[pytest]
pythonpath = src
testpaths = tests

View file

@ -0,0 +1,66 @@
"""
Tests for brain_watchdog.main.
Module-level env vars are required at import time; set them before the first
import of the module so tests can run without a real control-plane.
"""
import importlib.util
import os
import time
from unittest.mock import patch
os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
os.environ.setdefault("TG_TOKEN", "test_token")
os.environ.setdefault("TG_CHAT_ID", "12345")
import brain_watchdog.main as bwm
def test_package_importable():
spec = importlib.util.find_spec("brain_watchdog")
assert spec is not None
def test_check_ok_fresh():
now = time.time()
with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
ok, reason = bwm.check()
assert ok
assert "ok" in reason
def test_check_fail_stale():
now = time.time()
stale_ts = now - (bwm.STALE_THRESHOLD + 120)
with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
ok, reason = bwm.check()
assert not ok
assert "stale" in reason
def test_check_fail_unreachable():
with patch.object(bwm, "http_get", return_value=(None, None)):
ok, reason = bwm.check()
assert not ok
assert "unreachable" in reason
def test_check_fail_http_error():
with patch.object(bwm, "http_get", return_value=(503, None)):
ok, reason = bwm.check()
assert not ok
assert "503" in reason
def test_check_fail_missing_last_update():
with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
ok, reason = bwm.check()
assert not ok
assert "last_update" in reason
def test_check_fail_unparseable_timestamp():
with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
ok, reason = bwm.check()
assert not ok
assert "parseable" in reason