From d2fb2b3d41b548c4a7affb9463f0b55ad5245d96 Mon Sep 17 00:00:00 2001
From: Oskar Kapala <oskar.kapala@gmail.com>
Date: Mon, 8 Jun 2026 22:31:12 +0200
Subject: [PATCH] docs: onboard README + CLAUDE.md worktree discipline reminder
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

scripts/onboard/README.md (new):
- Tool purpose and --node/--step/--from/--dry-run usage
- Full node.yaml field schema with annotations (ssh_user uid-1000
  gotcha, first_contact IP vs .local, deploy_autonomy/git_control gates)
- Step status table (00-access DONE, 00-preflight SCAFFOLD, 10-50 TODO)
- lib/ architecture: run() dry-run convention, yaml_get fallback caveats
- Gotchas/Learnings table from session

CLAUDE.md:
- Node Onboarding section: onboard.sh commands, pointer to README
- Multi-agent worktree mode: add explicit DISCIPLINE RULE — feature
  work must happen in agent.sh worktrees, not the main checkout;
  references the 2026-06-08 session that violated this

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 CLAUDE.md                 |  18 +++++
 scripts/onboard/README.md | 139 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)
 create mode 100644 scripts/onboard/README.md
diff --git a/CLAUDE.md b/CLAUDE.md
index 33551c9..edcaf36 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -30,10 +30,22 @@ scripts/deploy/deploy.sh --service mosquitto   # specific service only
 ./scripts/deploy/deploy-node.sh chelsty-infra  # CHELSTY nodes (individually)
 ./scripts/bootstrap/prepare-node.sh            # general node bootstrap
 ./scripts/bootstrap/chelsty-runtime.sh         # CHELSTY-specific bootstrap
+scripts/onboard/onboard.sh --node <name>       # onboard a new node (idempotent, bash)
+scripts/onboard/onboard.sh --node <name> --step 00-access   # single step
+scripts/onboard/onboard.sh --node <name> --dry-run          # simulate
 ```
 
 Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
 
+## Node Onboarding
+
+New nodes are onboarded via `scripts/onboard/` — an idempotent bash tool driven by
+`hosts/<node>/node.yaml` manifests (no Ansible). See `scripts/onboard/README.md` for
+the full schema, step status table, and gotchas.
+
+Key fields in `node.yaml`: `ssh_user`, `first_contact` (LAN IP — not `.local`),
+`tailscale.hostname`, `deploy_autonomy`, `git_control`, `hardware.*`.
+
 ## Service Structure
 
 Every service must follow this layout:
@@ -186,6 +198,12 @@ Before any new or changed service is considered ready:
 `~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
 Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
 
+**DISCIPLINE RULE — enforced after 2026-06-08 session violation:**
+All feature/implementation work MUST happen in a task worktree, never directly in the main
+checkout. The main checkout is for reading context and running deploys only. If you are
+about to create a new branch or make implementation commits while `pwd` is
+`~/homelab-codex-ws`, stop and ask the operator to run `agent.sh new <name>` first.
+
 If `.agent-task` exists in your current working directory, you are in a task worktree.
 **You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
 before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
diff --git a/scripts/onboard/README.md b/scripts/onboard/README.md
new file mode 100644
index 0000000..35530af
--- /dev/null
+++ b/scripts/onboard/README.md
@@ -0,0 +1,139 @@
+# scripts/onboard — Node Onboarding Tool
+
+Idempotentny, deklaratywny onboarding nodów przez bash — bez Ansible.
+Każdy node opisany jest manifestem `hosts/<node>/node.yaml`; skrypt
+`onboard.sh` czyta manifest i woła numerowane kroki w kolejności.
+
+## Użycie
+
+```bash
+scripts/onboard/onboard.sh --node <name> [--step <name>] [--from <step>] [--dry-run]
+```
+
+| Flaga | Opis |
+|-------|------|
+| `--node <name>` | Nazwa node'a (wymagana); pasuje do `hosts/<name>/node.yaml` |
+| `--step <name>` | Uruchom tylko ten jeden krok (np. `00-access`) |
+| `--from <step>` | Zacznij od tego kroku i kontynuuj do końca |
+| `--dry-run` | Ustawia `DRY_RUN=1`; mutacje symulowane przez `run()`, sondy wykonywane naprawdę |
+
+```bash
+# Pełny onboarding
+scripts/onboard/onboard.sh --node lustro
+
+# Tylko jeden krok
+scripts/onboard/onboard.sh --node lustro --step 00-access
+
+# Od kroku wzwyż
+scripts/onboard/onboard.sh --node lustro --from 10-bootstrap-runtime
+
+# Podgląd bez zmian (sondy stanu wykonują się naprawdę — plan jest realistyczny)
+scripts/onboard/onboard.sh --node lustro --dry-run
+```
+
+## hosts/\<node\>/node.yaml — schemat
+
+```yaml
+name: LUSTRO                        # nazwa node'a (ALL CAPS)
+role: edge                          # edge | compute | infra
+location: KEN                       # identyfikator lokalizacji
+
+ssh_user: pi                        # user SSH; może różnić się od "oskar" na edge nodach
+                                    # (kolizja uid=1000 — użyj istniejącego usera)
+first_contact: pi@192.168.31.19     # cel SSH przed Tailscale; KONIECZNIE IP, nie .local
+                                    # (mDNS .local zawodny w automatyzacji)
+tailscale:
+  hostname: lustro                  # nazwa w mesh; cel po tailscale up
+  ip:                               # wypełniane po join (opcjonalne)
+
+deploy_autonomy: true               # true = onboard.sh może wykonywać mutacje autonomicznie
+                                    # false = wydrukuj instrukcje manualne i zatrzymaj
+git_control: false                  # true = node pulluje z Forgejo
+                                    # false = push-based z SATURN (edge nodes)
+
+hardware:
+  arch: arm64                       # aarch64 | x86_64 | armv7l; wypełnia 00-preflight
+  ram_mb: 4096                      # RAM w MB; wypełnia 00-preflight
+  swap:
+    kind: zram                      # zram | file | none; zram zalecany (SD wear)
+  docker_present: true              # docker już zainstalowany?; wypełnia 00-preflight
+  mm_runtime: systemd:magicmirror.service
+                                    # runtime MagicMirror: systemd:<unit> | pm2 | process | none
+                                    # wypełnia 00-preflight
+
+services:
+  node-agent:
+    runtime:
+      engine: docker                # docker | docker-compose
+      mem_limit: 256m               # obowiązkowy (RPi4 RAM profil jak VPS — OOM ryzyko)
+```
+
+### Uwagi do pól
+
+- **`ssh_user`** — na edge nodach z istniejącym uid=1000 (np. `pi` na RPi OS) użyj
+  tego usera zamiast tworzyć `oskar`; docker group membership i `mem_limit` node-agenta
+  są zaprojektowane pod `1000:1000`.
+- **`first_contact`** — zawsze IP, nie hostname `.local`. mDNS okazał się zawodny
+  w automatyzacji (transient resolve fail). Po `tailscale up` używaj `tailscale.hostname`.
+- **`deploy_autonomy`** — gdy `false`, kroki 10+ wypisują instrukcje manualne i kończą
+  pracę bez mutacji. Przydatne dla nodów zarządzanych przez inną osobę.
+- **`git_control`** — gdy `false`, kroki z `git`/`repo`/`clone` w nazwie są pomijane.
+
+## Status kroków
+
+| Krok | Plik | Status | Opis |
+|------|------|--------|------|
+| `00-access` | `steps/00-access.sh` | **DONE** | SSH key → `first_contact`, install Tailscale, `tailscale up` (interaktywny URL), verify `pi@<ts_hostname>` arch=aarch64 |
+| `00-preflight` | `steps/00-preflight.sh` | SCAFFOLD | Read-only: zbiera fakty (arch, RAM, docker, swap, MM runtime), wypisuje raport + YAML snippet do wklejenia w node.yaml |
+| `10-bootstrap-runtime` | `steps/10-bootstrap-runtime.sh` | TODO | Tworzy `/opt/homelab/` layout, `chown <ssh_user>` |
+| `20-install-docker` | `steps/20-install-docker.sh` | TODO | Instaluje Docker Engine jeśli `docker_present=false`; skip gdy już zainstalowany |
+| `30-install-tailscale` | `steps/30-install-tailscale.sh` | TODO | Superseded przez `00-access` dla nowych nodów; może służyć do re-join |
+| `40-deploy-node-agent` | `steps/40-deploy-node-agent.sh` | TODO | Deploy node-agent docker; user 1000:1000; `mem_limit` z node.yaml |
+| `50-verify` | `steps/50-verify.sh` | TODO | End-to-end smoke: event dotarł do control plane, widać w UI, alert path Telegram |
+
+## Architektura lib/
+
+```
+lib/common.sh   — log/warn/die/step/dryrun, run(), yaml_get, ensure_line, git() wrapper
+lib/remote.sh   — rrun/rcopy/rsync_dir/rcheck (SSH wrappers, ONBOARD_SSH_USER/HOST)
+```
+
+### run() i dry-run
+
+`DRY_RUN=1` jest eksportowane do wszystkich step-skryptów przez orchestrator.
+
+```bash
+# Mutacje owijamy w run() — w dry-run drukuje intent, nie wykonuje
+run ssh-copy-id -i ~/.ssh/id_ed25519.pub pi@192.168.31.19
+
+# Sondy stanu (ssh BatchMode test, command -v, status query) wykonują się ZAWSZE
+# — dry-run musi pokazywać realistyczny plan oparty na aktualnym stanie
+if ssh -o BatchMode=yes pi@192.168.31.19 true 2>/dev/null; then
+    log "key already present — skip"
+fi
+```
+
+### yaml_get — fallback bez yq
+
+Gdy `yq` nie jest dostępne, używany jest `grep`+`sed` fallback. Pułapki:
+
+- Inline komentarze YAML (`key: value   # komentarz`) są strippowane przez
+  `s/[[:space:]]\+#.*$//` — wymaga co najmniej jednej spacji przed `#`, więc
+  `url#fragment` pozostaje nienaruszone.
+- Parser jest non-greedy na `:` — `s/^[[:space:]]*[^:]*:[[:space:]]*//'` —
+  wartości z dwukropkiem (np. `systemd:magicmirror.service`) są czytane poprawnie.
+- Dot-path (`tailscale.hostname`) działa tylko z `yq`; fallback pasuje po ostatnim
+  segmencie (`hostname`). Nazwy pól w node.yaml muszą być unikalne.
+
+## Gotchas / Learnings
+
+| Problem | Rozwiązanie |
+|---------|-------------|
+| mDNS `.local` zawodny | Użyj IP w `first_contact`; `.local` OK interaktywnie, nie w automatyzacji |
+| Istniejący uid=1000 na edge node | Użyj tego usera; nie twórz `oskar` (kolizja uid, zepsuje własność MM) |
+| swap plik na SD | Migruj na zram — wear reduction; dodaj krok do `10-bootstrap-runtime` |
+| dry-run zatrzymuje się na orchestratorze | `run()` wrapper + `export DRY_RUN=1`; sondy muszą działać też w dry-run |
+| SSH known-hosts warning w parsowanym output | `-o LogLevel=ERROR` na SSH do nowego hosta w mesh |
+| `yaml_get` gubi prefix po `:` w wartości | Non-greedy `^[[:space:]]*[^:]*:` zamiast `.*:` |
+| yaml_get nie usuwa inline komentarzy | `s/[[:space:]]\+#.*$//` po ekstrakcji wartości |
+| RPi4 4 GB RAM — OOM ryzyko | `mem_limit` w node-agent override obowiązkowy (profil jak VPS) |