maturity: partial # PROVEN: 00-access, 20-base, 30-node-agent; WRITTEN: 40-register, 50-verify (live pending). Update after each step lands on a real node.
| `40-register` | `steps/40-register.sh` | WRITTEN | Dopisuje node do `inventory/topology.yaml` + tworzy `hosts/<node>/services.yaml`, commit na branchu (bez push) |
first_contact: pi@192.168.31.19 # LAN IP — NEVER .local (mDNS unreliable in automation)
tailscale:
hostname: lustro # mesh name; switch to this after tailscale up
ip: # fill after join
deploy_autonomy: true # false → print manual instructions and stop
git_control: false # false → push-based from SATURN (edge nodes)
hardware:
arch: arm64 # filled by 00-preflight
ram_mb: 4096 # filled by 00-preflight
swap:
kind: zram # zram | file | none
docker_present: true # filled by 00-preflight
mm_runtime: systemd:magicmirror.service # filled by 00-preflight; none if absent
services:
node-agent:
runtime:
engine: docker
mem_limit: 256m # mandatory on RAM-constrained hosts (≤4 GB)
```
preflight fills `arch`, `ram_mb`, `docker_present`, `mm_runtime` — do NOT guess these.
Full schema: `scripts/onboard/README.md`.
---
## Operational rules (PROVEN)
**PLAN-FIRST** — before any mutation, show exactly what will touch the remote host.
Always run `--dry-run` first; dry-run must print real commands (`run()` propagation).
**Idempotency** — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.
**Isolation** — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).
**Worktree discipline** — onboarding is a feature. Work in a task worktree (`agent.sh new`), never in the main checkout (`~/homelab-codex-ws` is deploy-only). See [[worktree-aware]].
---
## Gotchas (battle-tested)
| Problem | Fix |
|---------|-----|
| mDNS `.local` resolve fail | Always use LAN IP in `first_contact`; `.local` OK interactively, not in automation |
| uid=1000 collision on RPi OS | If `pi` already holds uid=1000 → USE that user, don't create `oskar`. node-agent `1000:1000` matches out-of-box; creating a second uid=1000 breaks MM ownership |
| passwordless sudo not guaranteed | Verify `sudo -n true` exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to `10-bootstrap-runtime` |
| RAM ≤4 GB with heavy app | `mem_limit` on node-agent is mandatory — same OOM profile as VPS |
| Docker already installed | Check `docker_present` from preflight; skip install step if true |
| SSH known-hosts warning in parsed output | Pass `-o LogLevel=ERROR` to SSH for new mesh hosts |
| `yaml_get` drops value prefix after `:` | Non-greedy colon: `s/^[[:space:]]*[^:]*:[[:space:]]*//'` — handles `systemd:unit` correctly |
| `yaml_get` keeps inline YAML comments | Strip with `s/[[:space:]]\+#.*$//` after extraction (requires ≥1 space before `#`) |
| dry-run stops at orchestrator level | `run()` wrapper + `export DRY_RUN=1` propagated to all step scripts; probes execute for real |
| rsync push Permission denied to VPS events/ | ssh-user must be in the **group that owns `/opt/homelab/events/`** (aerbot/1000 on VPS). Symptom: silent WARNING in node-agent log, 292k files backlog, panel stale. Fix: `usermod -aG 1000 <user>` on VPS + re-login |
| observer not seeing new node after topology.yaml edit | `_load_inventory()` runs once at `__init__`. After `git pull` on VPS (bind-mount is live), **`docker restart control-plane-observer`** is required — no redeploy needed |
| worktree on wrong branch | Always check `git branch --show-current` on entry. One task = one worktree (`agent.sh new`). Never manually `git checkout` between task branches in the same worktree |
`run()` contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, `command -v`, status queries) always execute so the plan is realistic.
---
## Definition of Done
A node is fully onboarded when:
1.`50-verify` exits 0 — event visible in control-plane UI and Telegram alert path confirmed.
2.`hosts/<node>/node.yaml` committed with all preflight fields filled.
3.`hosts/<node>/capabilities.yaml` present and accurate.