---
name: node-onboarding
description: >
  Use when the user wants to add or onboard a new node to homelab-codex —
  repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration.
  Keywords: "nowy node", "dodaj node", "onboarding", "onboard node".
living_doc: true
maturity: partial  # PROVEN: 00-access, 20-base, 30-node-agent; WRITTEN: 40-register, 50-verify (live pending). Update after each step lands on a real node.
---

> **Living document** — sections marked **SCAFFOLD** are stubs waiting for battle-testing on a real node.
> Promote to **PROVEN** after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.

## Trigger

User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.

---

## Workflow — one step at a time

```
preflight (read-only)
  └─ 00-access        [PROVEN]
       └─ 20-base     [PROVEN]
            └─ 30-node-agent   [PROVEN]
                 └─ 40-register     [WRITTEN — live pending]
                      └─ 50-verify  [WRITTEN — live pending]
```

Never skip ahead. Each step must exit 0 before the next begins.

---

## Invocation

```bash
# Full onboarding (all steps in order)
scripts/onboard/onboard.sh --node <name>

# Single step
scripts/onboard/onboard.sh --node <name> --step 00-access

# Resume from a step
scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime

# Dry-run — probes run for real; mutations are printed, not executed
scripts/onboard/onboard.sh --node <name> --dry-run
```

---

## Step status table

| Step | File | Status | What it does |
|------|------|--------|--------------|
| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml |
| `00-access` | `steps/00-access.sh` | **PROVEN** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interactive URL), verify over mesh |
| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | SCAFFOLD | Create `/opt/homelab/` layout, `chown <ssh_user>` |
| `20-base` | `steps/20-base.sh` | **PROVEN** | swap→zram, `/opt/homelab/` layout, event dir `/opt/homelab/events/<node>/` |
| `20-install-docker` | `steps/20-install-docker.sh` | SCAFFOLD | Install Docker Engine if `docker_present=false`; skip if already installed |
| `30-node-agent` | `steps/30-node-agent.sh` | **PROVEN** | rsync base compose + override, `docker compose up -d --build`, verify container + events |
| `40-register` | `steps/40-register.sh` | WRITTEN | Dopisuje node do `inventory/topology.yaml` + tworzy `hosts/<node>/services.yaml`, commit na branchu (bez push) |
| `50-verify` | `steps/50-verify.sh` | WRITTEN | SSH node: container+events; SSH VPS: restart observer + heartbeat poll + world/nodes.json |

---

## node.yaml — key fields

```yaml
name: LUSTRO                        # ALL CAPS
role: edge                          # edge | compute | infra
ssh_user: pi                        # existing user on the node
first_contact: pi@192.168.31.19     # LAN IP — NEVER .local (mDNS unreliable in automation)
tailscale:
  hostname: lustro                  # mesh name; switch to this after tailscale up
  ip:                               # fill after join
deploy_autonomy: true               # false → print manual instructions and stop
git_control: false                  # false → push-based from SATURN (edge nodes)
hardware:
  arch: arm64                       # filled by 00-preflight
  ram_mb: 4096                      # filled by 00-preflight
  swap:
    kind: zram                      # zram | file | none
  docker_present: true              # filled by 00-preflight
  mm_runtime: systemd:magicmirror.service   # filled by 00-preflight; none if absent
services:
  node-agent:
    runtime:
      engine: docker
      mem_limit: 256m               # mandatory on RAM-constrained hosts (≤4 GB)
```

preflight fills `arch`, `ram_mb`, `docker_present`, `mm_runtime` — do NOT guess these.

Full schema: `scripts/onboard/README.md`.

---

## Operational rules (PROVEN)

**PLAN-FIRST** — before any mutation, show exactly what will touch the remote host.
Always run `--dry-run` first; dry-run must print real commands (`run()` propagation).

**Idempotency** — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.

**Isolation** — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).

**Worktree discipline** — onboarding is a feature. Work in a task worktree (`agent.sh new`), never in the main checkout (`~/homelab-codex-ws` is deploy-only). See [[worktree-aware]].

---

## Gotchas (battle-tested)

| Problem | Fix |
|---------|-----|
| mDNS `.local` resolve fail | Always use LAN IP in `first_contact`; `.local` OK interactively, not in automation |
| uid=1000 collision on RPi OS | If `pi` already holds uid=1000 → USE that user, don't create `oskar`. node-agent `1000:1000` matches out-of-box; creating a second uid=1000 breaks MM ownership |
| passwordless sudo not guaranteed | Verify `sudo -n true` exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to `10-bootstrap-runtime` |
| RAM ≤4 GB with heavy app | `mem_limit` on node-agent is mandatory — same OOM profile as VPS |
| Docker already installed | Check `docker_present` from preflight; skip install step if true |
| SSH known-hosts warning in parsed output | Pass `-o LogLevel=ERROR` to SSH for new mesh hosts |
| `yaml_get` drops value prefix after `:` | Non-greedy colon: `s/^[[:space:]]*[^:]*:[[:space:]]*//'` — handles `systemd:unit` correctly |
| `yaml_get` keeps inline YAML comments | Strip with `s/[[:space:]]\+#.*$//` after extraction (requires ≥1 space before `#`) |
| dry-run stops at orchestrator level | `run()` wrapper + `export DRY_RUN=1` propagated to all step scripts; probes execute for real |
| rsync push Permission denied to VPS events/ | ssh-user must be in the **group that owns `/opt/homelab/events/`** (aerbot/1000 on VPS). Symptom: silent WARNING in node-agent log, 292k files backlog, panel stale. Fix: `usermod -aG 1000 <user>` on VPS + re-login |
| node-agent SSH key mount target | Mount the push key under the **container's HOME**: `/home/homelab/.ssh` (uid 1000 `homelab`), **NOT `/root/.ssh`** — ssh in `_ship_events_to_vps()` has no `-i` and only looks in `$HOME/.ssh`; a `/root/.ssh` mount is blind → `Permission denied` (lustro 2026-06-11, fix `a5a1352`). The new node's pubkey must also land in `authorized_keys` of `oskar@VPS` |
| observer not seeing new node after topology.yaml edit | `_load_inventory()` runs once at `__init__`. After `git pull` on VPS (bind-mount is live), **`docker restart control-plane-observer`** is required — no redeploy needed |
| worktree on wrong branch | Always check `git branch --show-current` on entry. One task = one worktree (`agent.sh new`). Never manually `git checkout` between task branches in the same worktree |

---

## lib/ reference

```
lib/common.sh  — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
lib/remote.sh  — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)
```

`run()` contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, `command -v`, status queries) always execute so the plan is realistic.

---

## Definition of Done

A node is fully onboarded when:

1. `50-verify` exits 0 — event visible in control-plane UI and Telegram alert path confirmed.
2. `hosts/<node>/node.yaml` committed with all preflight fields filled.
3. `hosts/<node>/capabilities.yaml` present and accurate.
4. Node appears in `inventory/topology.yaml`.