ECC-format skill for the node onboarding workflow. Covers full step sequence, operational rules, node.yaml key fields, gotchas from LUSTRO session, and Definition of Done. Marked as living doc — SCAFFOLD sections to be promoted to PROVEN as steps land on real nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
147 lines
6.2 KiB
Markdown
147 lines
6.2 KiB
Markdown
---
|
|
name: node-onboarding
|
|
description: >
|
|
Use when the user wants to add or onboard a new node to homelab-codex —
|
|
repo manifest, Tailscale mesh, node-agent, monitoring, and UI registration.
|
|
Keywords: "nowy node", "dodaj node", "onboarding", "onboard node".
|
|
living_doc: true
|
|
maturity: partial # PROVEN: 00-access; SCAFFOLD: base → verify. Update after each step lands on a real node.
|
|
---
|
|
|
|
> **Living document** — sections marked **SCAFFOLD** are stubs waiting for battle-testing on a real node.
|
|
> Promote to **PROVEN** after each step passes end-to-end. Do not treat SCAFFOLD sections as authoritative.
|
|
|
|
## Trigger
|
|
|
|
User asks to onboard / add a new node. Load this skill before touching any onboarding script or node.yaml.
|
|
|
|
---
|
|
|
|
## Workflow — one step at a time
|
|
|
|
```
|
|
preflight (read-only)
|
|
└─ 00-access [PROVEN]
|
|
└─ base [SCAFFOLD]
|
|
└─ node-agent [SCAFFOLD]
|
|
└─ register [SCAFFOLD]
|
|
└─ verify(50) [SCAFFOLD]
|
|
```
|
|
|
|
Never skip ahead. Each step must exit 0 before the next begins.
|
|
|
|
---
|
|
|
|
## Invocation
|
|
|
|
```bash
|
|
# Full onboarding (all steps in order)
|
|
scripts/onboard/onboard.sh --node <name>
|
|
|
|
# Single step
|
|
scripts/onboard/onboard.sh --node <name> --step 00-access
|
|
|
|
# Resume from a step
|
|
scripts/onboard/onboard.sh --node <name> --from 10-bootstrap-runtime
|
|
|
|
# Dry-run — probes run for real; mutations are printed, not executed
|
|
scripts/onboard/onboard.sh --node <name> --dry-run
|
|
```
|
|
|
|
---
|
|
|
|
## Step status table
|
|
|
|
| Step | File | Status | What it does |
|
|
|------|------|--------|--------------|
|
|
| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: arch, RAM, docker, swap, MM runtime → YAML snippet for node.yaml |
|
|
| `00-access` | `steps/00-access.sh` | **PROVEN** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interactive URL), verify over mesh |
|
|
| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | SCAFFOLD | Create `/opt/homelab/` layout, `chown <ssh_user>` |
|
|
| `20-install-docker` | `steps/20-install-docker.sh` | SCAFFOLD | Install Docker Engine if `docker_present=false`; skip if already installed |
|
|
| `40-deploy-node-agent` | `steps/40-deploy-node-agent.sh` | SCAFFOLD | Deploy node-agent container; user 1000:1000; `mem_limit` from node.yaml |
|
|
| `50-verify` | `steps/50-verify.sh` | SCAFFOLD | End-to-end smoke: event reaches control plane, visible in UI, Telegram alert path |
|
|
|
|
---
|
|
|
|
## node.yaml — key fields
|
|
|
|
```yaml
|
|
name: LUSTRO # ALL CAPS
|
|
role: edge # edge | compute | infra
|
|
ssh_user: pi # existing user on the node
|
|
first_contact: pi@192.168.31.19 # LAN IP — NEVER .local (mDNS unreliable in automation)
|
|
tailscale:
|
|
hostname: lustro # mesh name; switch to this after tailscale up
|
|
ip: # fill after join
|
|
deploy_autonomy: true # false → print manual instructions and stop
|
|
git_control: false # false → push-based from SATURN (edge nodes)
|
|
hardware:
|
|
arch: arm64 # filled by 00-preflight
|
|
ram_mb: 4096 # filled by 00-preflight
|
|
swap:
|
|
kind: zram # zram | file | none
|
|
docker_present: true # filled by 00-preflight
|
|
mm_runtime: systemd:magicmirror.service # filled by 00-preflight; none if absent
|
|
services:
|
|
node-agent:
|
|
runtime:
|
|
engine: docker
|
|
mem_limit: 256m # mandatory on RAM-constrained hosts (≤4 GB)
|
|
```
|
|
|
|
preflight fills `arch`, `ram_mb`, `docker_present`, `mm_runtime` — do NOT guess these.
|
|
|
|
Full schema: `scripts/onboard/README.md`.
|
|
|
|
---
|
|
|
|
## Operational rules (PROVEN)
|
|
|
|
**PLAN-FIRST** — before any mutation, show exactly what will touch the remote host.
|
|
Always run `--dry-run` first; dry-run must print real commands (`run()` propagation).
|
|
|
|
**Idempotency** — every step is safe to re-run. Keys, Tailscale join, Docker install → skip if already done.
|
|
|
|
**Isolation** — do NOT touch existing services on the node (e.g. MagicMirror as systemd unit).
|
|
|
|
**Worktree discipline** — onboarding is a feature. Work in a task worktree (`agent.sh new`), never in the main checkout (`~/homelab-codex-ws` is deploy-only). See [[worktree-aware]].
|
|
|
|
---
|
|
|
|
## Gotchas (battle-tested)
|
|
|
|
| Problem | Fix |
|
|
|---------|-----|
|
|
| mDNS `.local` resolve fail | Always use LAN IP in `first_contact`; `.local` OK interactively, not in automation |
|
|
| uid=1000 collision on RPi OS | If `pi` already holds uid=1000 → USE that user, don't create `oskar`. node-agent `1000:1000` matches out-of-box; creating a second uid=1000 breaks MM ownership |
|
|
| passwordless sudo not guaranteed | Verify `sudo -n true` exits 0 before any sudo-over-SSH step. RPi OS default may require password; ssh without TTY will hang |
|
|
| swap file on SD card | Use zram, not a swap file (SD wear). Add migration to `10-bootstrap-runtime` |
|
|
| RAM ≤4 GB with heavy app | `mem_limit` on node-agent is mandatory — same OOM profile as VPS |
|
|
| Docker already installed | Check `docker_present` from preflight; skip install step if true |
|
|
| SSH known-hosts warning in parsed output | Pass `-o LogLevel=ERROR` to SSH for new mesh hosts |
|
|
| `yaml_get` drops value prefix after `:` | Non-greedy colon: `s/^[[:space:]]*[^:]*:[[:space:]]*//'` — handles `systemd:unit` correctly |
|
|
| `yaml_get` keeps inline YAML comments | Strip with `s/[[:space:]]\+#.*$//` after extraction (requires ≥1 space before `#`) |
|
|
| dry-run stops at orchestrator level | `run()` wrapper + `export DRY_RUN=1` propagated to all step scripts; probes execute for real |
|
|
|
|
---
|
|
|
|
## lib/ reference
|
|
|
|
```
|
|
lib/common.sh — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
|
|
lib/remote.sh — rrun/rcopy/rsync_dir/rcheck (SSH wrappers; uses ONBOARD_SSH_USER / ONBOARD_HOST)
|
|
```
|
|
|
|
`run()` contract: in dry-run mode prints intent without executing; probes (ssh BatchMode=yes, `command -v`, status queries) always execute so the plan is realistic.
|
|
|
|
---
|
|
|
|
## Definition of Done
|
|
|
|
A node is fully onboarded when:
|
|
|
|
1. `50-verify` exits 0 — event visible in control-plane UI and Telegram alert path confirmed.
|
|
2. `hosts/<node>/node.yaml` committed with all preflight fields filled.
|
|
3. `hosts/<node>/capabilities.yaml` present and accurate.
|
|
4. Node appears in `inventory/topology.yaml`.
|