Compare commits
2 commits
cfe5e02372
...
f381023206
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f381023206 | ||
|
|
cb4ae756ab |
50
CLAUDE.md
50
CLAUDE.md
|
|
@ -106,6 +106,48 @@ When exploring the system, use these files in order:
|
||||||
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
||||||
4. `services/<service>/service.yaml` — operational contract for a service
|
4. `services/<service>/service.yaml` — operational contract for a service
|
||||||
|
|
||||||
|
## VPS-Specific Rules
|
||||||
|
|
||||||
|
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
|
||||||
|
|
||||||
|
### Memory limit convention
|
||||||
|
|
||||||
|
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
myservice:
|
||||||
|
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
|
||||||
|
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
|
||||||
|
```
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
|
||||||
|
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
|
||||||
|
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
|
||||||
|
|
||||||
|
### Repo-managed services on VPS
|
||||||
|
|
||||||
|
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
|
||||||
|
|
||||||
|
| Service | Compose stack | Data path |
|
||||||
|
|---|---|---|
|
||||||
|
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
|
||||||
|
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
|
||||||
|
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
|
||||||
|
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
|
||||||
|
|
||||||
|
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
|
||||||
|
|
||||||
|
**Cutover checklist** (before running `docker compose up` for any migrated service):
|
||||||
|
1. `git pull` on VPS
|
||||||
|
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
|
||||||
|
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
|
||||||
|
4. For mosquitto: config stays at old bind path until explicitly migrated
|
||||||
|
5. Verify named volumes exist: `docker volume ls | grep <project>`
|
||||||
|
|
||||||
|
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
|
||||||
|
|
||||||
## CHELSTY-Specific Rules
|
## CHELSTY-Specific Rules
|
||||||
|
|
||||||
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
||||||
|
|
@ -124,6 +166,14 @@ When exploring the system, use these files in order:
|
||||||
- `world/` — Observer output (synthesized state)
|
- `world/` — Observer output (synthesized state)
|
||||||
- `actions/` — pending / approved / running / completed / failed
|
- `actions/` — pending / approved / running / completed / failed
|
||||||
|
|
||||||
|
## Definition of Done (serwisy)
|
||||||
|
|
||||||
|
Before any new or changed service is considered ready:
|
||||||
|
|
||||||
|
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
|
||||||
|
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
|
||||||
|
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
|
||||||
|
|
||||||
## Naming Conventions
|
## Naming Conventions
|
||||||
|
|
||||||
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
||||||
|
|
|
||||||
3
services/brain-watchdog/pytest.ini
Normal file
3
services/brain-watchdog/pytest.ini
Normal file
|
|
@ -0,0 +1,3 @@
|
||||||
|
[pytest]
|
||||||
|
pythonpath = src
|
||||||
|
testpaths = tests
|
||||||
0
services/brain-watchdog/tests/__init__.py
Normal file
0
services/brain-watchdog/tests/__init__.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
|
|
@ -0,0 +1,66 @@
|
||||||
|
"""
|
||||||
|
Tests for brain_watchdog.main.
|
||||||
|
|
||||||
|
Module-level env vars are required at import time; set them before the first
|
||||||
|
import of the module so tests can run without a real control-plane.
|
||||||
|
"""
|
||||||
|
import importlib.util
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
|
||||||
|
os.environ.setdefault("TG_TOKEN", "test_token")
|
||||||
|
os.environ.setdefault("TG_CHAT_ID", "12345")
|
||||||
|
|
||||||
|
import brain_watchdog.main as bwm
|
||||||
|
|
||||||
|
|
||||||
|
def test_package_importable():
|
||||||
|
spec = importlib.util.find_spec("brain_watchdog")
|
||||||
|
assert spec is not None
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_ok_fresh():
|
||||||
|
now = time.time()
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert ok
|
||||||
|
assert "ok" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_stale():
|
||||||
|
now = time.time()
|
||||||
|
stale_ts = now - (bwm.STALE_THRESHOLD + 120)
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "stale" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_unreachable():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(None, None)):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "unreachable" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_http_error():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(503, None)):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "503" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_missing_last_update():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "last_update" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_unparseable_timestamp():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "parseable" in reason
|
||||||
Loading…
Reference in a new issue