Compare commits
2 commits
cfe5e02372
...
f381023206
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f381023206 | ||
|
|
cb4ae756ab |
50
CLAUDE.md
50
CLAUDE.md
|
|
@ -106,6 +106,48 @@ When exploring the system, use these files in order:
|
|||
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
||||
4. `services/<service>/service.yaml` — operational contract for a service
|
||||
|
||||
## VPS-Specific Rules
|
||||
|
||||
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
|
||||
|
||||
### Memory limit convention
|
||||
|
||||
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
|
||||
|
||||
```yaml
|
||||
services:
|
||||
myservice:
|
||||
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
|
||||
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
|
||||
```
|
||||
|
||||
Rules:
|
||||
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
|
||||
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
|
||||
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
|
||||
|
||||
### Repo-managed services on VPS
|
||||
|
||||
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
|
||||
|
||||
| Service | Compose stack | Data path |
|
||||
|---|---|---|
|
||||
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
|
||||
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
|
||||
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
|
||||
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
|
||||
|
||||
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
|
||||
|
||||
**Cutover checklist** (before running `docker compose up` for any migrated service):
|
||||
1. `git pull` on VPS
|
||||
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
|
||||
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
|
||||
4. For mosquitto: config stays at old bind path until explicitly migrated
|
||||
5. Verify named volumes exist: `docker volume ls | grep <project>`
|
||||
|
||||
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
|
||||
|
||||
## CHELSTY-Specific Rules
|
||||
|
||||
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
||||
|
|
@ -124,6 +166,14 @@ When exploring the system, use these files in order:
|
|||
- `world/` — Observer output (synthesized state)
|
||||
- `actions/` — pending / approved / running / completed / failed
|
||||
|
||||
## Definition of Done (serwisy)
|
||||
|
||||
Before any new or changed service is considered ready:
|
||||
|
||||
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
|
||||
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
|
||||
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
||||
|
|
|
|||
3
services/brain-watchdog/pytest.ini
Normal file
3
services/brain-watchdog/pytest.ini
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
[pytest]
|
||||
pythonpath = src
|
||||
testpaths = tests
|
||||
0
services/brain-watchdog/tests/__init__.py
Normal file
0
services/brain-watchdog/tests/__init__.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
"""
|
||||
Tests for brain_watchdog.main.
|
||||
|
||||
Module-level env vars are required at import time; set them before the first
|
||||
import of the module so tests can run without a real control-plane.
|
||||
"""
|
||||
import importlib.util
|
||||
import os
|
||||
import time
|
||||
from unittest.mock import patch
|
||||
|
||||
os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
|
||||
os.environ.setdefault("TG_TOKEN", "test_token")
|
||||
os.environ.setdefault("TG_CHAT_ID", "12345")
|
||||
|
||||
import brain_watchdog.main as bwm
|
||||
|
||||
|
||||
def test_package_importable():
|
||||
spec = importlib.util.find_spec("brain_watchdog")
|
||||
assert spec is not None
|
||||
|
||||
|
||||
def test_check_ok_fresh():
|
||||
now = time.time()
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
|
||||
ok, reason = bwm.check()
|
||||
assert ok
|
||||
assert "ok" in reason
|
||||
|
||||
|
||||
def test_check_fail_stale():
|
||||
now = time.time()
|
||||
stale_ts = now - (bwm.STALE_THRESHOLD + 120)
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "stale" in reason
|
||||
|
||||
|
||||
def test_check_fail_unreachable():
|
||||
with patch.object(bwm, "http_get", return_value=(None, None)):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "unreachable" in reason
|
||||
|
||||
|
||||
def test_check_fail_http_error():
|
||||
with patch.object(bwm, "http_get", return_value=(503, None)):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "503" in reason
|
||||
|
||||
|
||||
def test_check_fail_missing_last_update():
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "last_update" in reason
|
||||
|
||||
|
||||
def test_check_fail_unparseable_timestamp():
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "parseable" in reason
|
||||
Loading…
Reference in a new issue