Compare commits
16 commits
feat/vps-s
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
58ac6edd7d | ||
|
|
19fd8799d9 | ||
|
|
7f17b65278 | ||
|
|
e6a2443412 | ||
|
|
f9b145585f | ||
|
|
3b620ef7e3 | ||
|
|
745e52723c | ||
|
|
1abe925f65 | ||
|
|
1c69a5bc29 | ||
|
|
02e7c28823 | ||
|
|
db592fbc28 | ||
|
|
00fc36df3a | ||
|
|
f5dcefc752 | ||
|
|
98437d46b2 | ||
|
|
5e97b4e448 | ||
|
|
ffb0608b9a |
43
.claude/skills/deploy/SKILL.md
Normal file
43
.claude/skills/deploy/SKILL.md
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
---
|
||||||
|
name: deploy
|
||||||
|
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
|
||||||
|
---
|
||||||
|
|
||||||
|
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
|
||||||
|
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
|
||||||
|
|
||||||
|
## Targets
|
||||||
|
|
||||||
|
| Target | What it deploys |
|
||||||
|
|---|---|
|
||||||
|
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
|
||||||
|
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
|
||||||
|
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
|
||||||
|
| `solaria` | SOLARIA compute services |
|
||||||
|
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
|
||||||
|
|
||||||
|
## Invocation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scripts/deploy/deploy.sh <target> # full pipeline
|
||||||
|
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
|
||||||
|
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
|
||||||
|
```
|
||||||
|
|
||||||
|
## Exit Code Handling
|
||||||
|
|
||||||
|
| Code | Meaning | Required action |
|
||||||
|
|---|---|---|
|
||||||
|
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
|
||||||
|
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
|
||||||
|
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
|
||||||
|
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
|
||||||
|
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
|
||||||
|
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
|
||||||
|
|
||||||
|
## Rules
|
||||||
|
|
||||||
|
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
|
||||||
|
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
|
||||||
|
- Canonical branch is `master` — preflight enforces this.
|
||||||
|
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
|
||||||
65
.claude/skills/save-session/SKILL.md
Normal file
65
.claude/skills/save-session/SKILL.md
Normal file
|
|
@ -0,0 +1,65 @@
|
||||||
|
---
|
||||||
|
name: save-session
|
||||||
|
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
|
||||||
|
---
|
||||||
|
|
||||||
|
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
|
||||||
|
Never invoke proactively. Never invoke mid-task.
|
||||||
|
|
||||||
|
## 1. Determine Session Boundary
|
||||||
|
|
||||||
|
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
|
||||||
|
2. Fallback if no previous entry exists: 24 hours ago.
|
||||||
|
|
||||||
|
## 2. Collect Facts (deterministic only — no invention)
|
||||||
|
|
||||||
|
Run exactly:
|
||||||
|
```bash
|
||||||
|
# All commits since boundary
|
||||||
|
git --no-pager log --oneline <boundary>..HEAD
|
||||||
|
|
||||||
|
# Changed file summary
|
||||||
|
git --no-pager diff --stat <boundary>..HEAD
|
||||||
|
```
|
||||||
|
|
||||||
|
From the visible conversation transcript: deploys run and their outcomes, test results seen.
|
||||||
|
|
||||||
|
## 3. Write the Session Entry
|
||||||
|
|
||||||
|
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
|
||||||
|
Never overwrite existing content.
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Session HH:MM
|
||||||
|
|
||||||
|
### Commits
|
||||||
|
<output of git log --oneline>
|
||||||
|
|
||||||
|
### Files changed
|
||||||
|
<output of git diff --stat>
|
||||||
|
|
||||||
|
### Deploys
|
||||||
|
<list from transcript, or "None recorded">
|
||||||
|
|
||||||
|
### Narrative
|
||||||
|
> _user-provided summary_
|
||||||
|
```
|
||||||
|
|
||||||
|
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
|
||||||
|
|
||||||
|
## 4. What NOT to Touch
|
||||||
|
|
||||||
|
- `backlog.md` — only on explicit "update backlog" instruction
|
||||||
|
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
|
||||||
|
- Any other file not listed above
|
||||||
|
|
||||||
|
## 5. Commit
|
||||||
|
|
||||||
|
Stage and commit **only** the session file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/sessions/YYYY-MM-DD.md
|
||||||
|
git commit -m "docs: session YYYY-MM-DD HH:MM"
|
||||||
|
```
|
||||||
|
|
||||||
|
No other files. No `git add -A`.
|
||||||
81
.claude/skills/worktree-aware/SKILL.md
Normal file
81
.claude/skills/worktree-aware/SKILL.md
Normal file
|
|
@ -0,0 +1,81 @@
|
||||||
|
---
|
||||||
|
name: worktree-aware
|
||||||
|
description: >
|
||||||
|
Use when working in a git worktree checkout for a parallel agent task.
|
||||||
|
The presence of an .agent-task file in the current working directory indicates
|
||||||
|
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
|
||||||
|
to the assigned task branch, NEVER push origin master, NEVER touch the main
|
||||||
|
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
|
||||||
|
completion, report the branch name verbatim and stop — the human merges via
|
||||||
|
scripts/dev/agent.sh.
|
||||||
|
---
|
||||||
|
|
||||||
|
## When this applies
|
||||||
|
|
||||||
|
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
|
||||||
|
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
|
||||||
|
In the main checkout these rules do not apply.
|
||||||
|
|
||||||
|
## Reading the marker
|
||||||
|
|
||||||
|
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
task: my-feature
|
||||||
|
branch: task/my-feature
|
||||||
|
parent_commit: abc1234
|
||||||
|
created_utc: 2026-06-03T10:00:00Z
|
||||||
|
worktree_path: /home/oskar/homelab-codex-ws-my-feature
|
||||||
|
```
|
||||||
|
|
||||||
|
Always read this file first before taking any action.
|
||||||
|
|
||||||
|
## Rules
|
||||||
|
|
||||||
|
1. **Commit only to your branch.**
|
||||||
|
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
|
||||||
|
If it does not, stop immediately and report the discrepancy.
|
||||||
|
|
||||||
|
2. **Push only to your branch.**
|
||||||
|
The only permitted push is `git push origin task/<name>`.
|
||||||
|
NEVER `git push origin master` or any other branch.
|
||||||
|
|
||||||
|
3. **Do not touch the main checkout.**
|
||||||
|
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
|
||||||
|
Do not read from, write to, or execute commands inside it.
|
||||||
|
|
||||||
|
4. **Stay scoped.**
|
||||||
|
Only change files directly related to your assigned task.
|
||||||
|
If you notice other problems, report them in your final summary as separate follow-up proposals.
|
||||||
|
Do not fix them in this worktree.
|
||||||
|
|
||||||
|
5. **Never `git add -A`.**
|
||||||
|
Always stage specific files by name: `git add path/to/file`.
|
||||||
|
|
||||||
|
6. **Do not manage worktrees.**
|
||||||
|
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
|
||||||
|
Worktree lifecycle is the human's responsibility.
|
||||||
|
|
||||||
|
7. **Final report before stopping.**
|
||||||
|
When the task is done, provide a structured report containing:
|
||||||
|
- Files changed (path and one-line summary of change)
|
||||||
|
- Tests run and results
|
||||||
|
- All commit hashes on the task branch
|
||||||
|
- **Branch name verbatim** (copy-paste ready)
|
||||||
|
- Follow-up items as bulleted proposals for separate tasks
|
||||||
|
|
||||||
|
## Definition of Done
|
||||||
|
|
||||||
|
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
|
||||||
|
- Test suite passes
|
||||||
|
- Branch pushed: `git push origin task/<name>`
|
||||||
|
- Full report delivered in conversation
|
||||||
|
|
||||||
|
## What you do NOT do
|
||||||
|
|
||||||
|
- Merge branches
|
||||||
|
- Create or push tags
|
||||||
|
- Run deploys or healthchecks against production nodes
|
||||||
|
- Delete branches or worktrees
|
||||||
|
- Modify files in other worktrees
|
||||||
|
- Push to `origin master` under any circumstances
|
||||||
12
CLAUDE.md
12
CLAUDE.md
|
|
@ -180,3 +180,15 @@ Before any new or changed service is considered ready:
|
||||||
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
|
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
|
||||||
- Container names must match service names
|
- Container names must match service names
|
||||||
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
|
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
|
||||||
|
|
||||||
|
## Multi-agent worktree mode
|
||||||
|
|
||||||
|
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
|
||||||
|
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
|
||||||
|
|
||||||
|
If `.agent-task` exists in your current working directory, you are in a task worktree.
|
||||||
|
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
|
||||||
|
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
|
||||||
|
|
||||||
|
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
|
||||||
|
Agents never invoke these — only the human does.
|
||||||
|
|
|
||||||
|
|
@ -1,33 +0,0 @@
|
||||||
# AI cluster memory limits — HARD caps, containers are OOM-killed and auto-restarted
|
|
||||||
# by Docker rather than consuming host memory. ai-cluster is the primary OOM suspect
|
|
||||||
# (unbounded Python workers, no limits since deployment).
|
|
||||||
#
|
|
||||||
# Architectural note: compute workloads here should migrate to SOLARIA (GPU node).
|
|
||||||
# Until migration: contain the blast radius with per-container limits.
|
|
||||||
#
|
|
||||||
# Pre-cutover: service-ops-worker still mounts compose/env from old paths.
|
|
||||||
# After cutover and git pull, these overrides are removed and base compose paths are used.
|
|
||||||
|
|
||||||
services:
|
|
||||||
codex-worker:
|
|
||||||
mem_limit: 64m
|
|
||||||
|
|
||||||
openclaw:
|
|
||||||
mem_limit: 128m
|
|
||||||
|
|
||||||
planner-worker:
|
|
||||||
mem_limit: 64m
|
|
||||||
|
|
||||||
service-ops-worker:
|
|
||||||
mem_limit: 64m
|
|
||||||
# Pre-cutover: override bind mounts to keep pointing at old dockeruser paths
|
|
||||||
volumes:
|
|
||||||
- /home/dockeruser/docker/ai-cluster/docker-compose.yml:/app/docker-compose.yml:ro
|
|
||||||
- /home/dockeruser/docker/ai-cluster/.env:/app/.env:ro
|
|
||||||
- /var/run/docker.sock:/var/run/docker.sock:rw
|
|
||||||
|
|
||||||
redis:
|
|
||||||
mem_limit: 32m
|
|
||||||
|
|
||||||
mosquitto:
|
|
||||||
mem_limit: 32m
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
services:
|
|
||||||
app:
|
|
||||||
mem_limit: 224m
|
|
||||||
|
|
||||||
db:
|
|
||||||
mem_limit: 128m
|
|
||||||
|
|
@ -1,3 +0,0 @@
|
||||||
services:
|
|
||||||
node_exporter:
|
|
||||||
mem_limit: 32m
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
services:
|
|
||||||
npm:
|
|
||||||
mem_limit: 160m
|
|
||||||
# Public ingress — elevated OOM protection so TLS termination + proxy host
|
|
||||||
# config survive memory pressure. Host OOM-killer will not target this container.
|
|
||||||
oom_score_adj: -800
|
|
||||||
|
|
@ -1,9 +0,0 @@
|
||||||
services:
|
|
||||||
outline:
|
|
||||||
mem_limit: 512m
|
|
||||||
|
|
||||||
postgres:
|
|
||||||
mem_limit: 96m
|
|
||||||
|
|
||||||
redis:
|
|
||||||
mem_limit: 32m
|
|
||||||
|
|
@ -41,81 +41,3 @@ services:
|
||||||
depends_on:
|
depends_on:
|
||||||
local: []
|
local: []
|
||||||
external: []
|
external: []
|
||||||
|
|
||||||
npm:
|
|
||||||
role: reverse-proxy-ingress
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: public
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: []
|
|
||||||
ports:
|
|
||||||
- name: http
|
|
||||||
container_port: 80
|
|
||||||
protocol: tcp
|
|
||||||
- name: https
|
|
||||||
container_port: 443
|
|
||||||
protocol: tcp
|
|
||||||
- name: admin
|
|
||||||
container_port: 81
|
|
||||||
protocol: tcp
|
|
||||||
runtime:
|
|
||||||
data_path: /home/dockeruser/docker/npm/data
|
|
||||||
config_path: /opt/homelab/config/npm
|
|
||||||
|
|
||||||
outline:
|
|
||||||
role: team-wiki
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: public
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local:
|
|
||||||
- npm
|
|
||||||
external: []
|
|
||||||
ports:
|
|
||||||
- name: http
|
|
||||||
container_port: 3000
|
|
||||||
protocol: tcp
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/outline
|
|
||||||
|
|
||||||
joplin:
|
|
||||||
role: note-sync-server
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: tailscale-internal
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local:
|
|
||||||
- npm
|
|
||||||
external: []
|
|
||||||
ports:
|
|
||||||
- name: http
|
|
||||||
container_port: 22300
|
|
||||||
bind: 127.0.0.1
|
|
||||||
protocol: tcp
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/joplin
|
|
||||||
|
|
||||||
ai-cluster:
|
|
||||||
role: ai-worker-cluster
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: tailscale-internal
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external:
|
|
||||||
- piha:gateway
|
|
||||||
ports:
|
|
||||||
- name: openclaw-api
|
|
||||||
container_port: 8000
|
|
||||||
protocol: tcp
|
|
||||||
- name: mqtt
|
|
||||||
container_port: 1883
|
|
||||||
protocol: tcp
|
|
||||||
bind: tailscale
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/ai-cluster
|
|
||||||
notes:
|
|
||||||
- "Local images must be built on VPS — not pulled from registry"
|
|
||||||
- "Compute workloads belong on SOLARIA; migrate when possible"
|
|
||||||
|
|
|
||||||
|
|
@ -1,270 +1,321 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# deploy.sh - Staged deployment framework for homelab nodes.
|
# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
|
||||||
|
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||||
|
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
|
||||||
|
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||||
|
|
||||||
set -o pipefail
|
set -uo pipefail
|
||||||
|
|
||||||
# --- Configuration ---
|
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||||
export RUNTIME_PATH="/opt/homelab"
|
SSH_USER="${SSH_USER:-oskar}"
|
||||||
export STATE_DIR="${RUNTIME_PATH}/state/deploy"
|
START_TIME=$(date +%s)
|
||||||
export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
|
TARGET=""
|
||||||
export REPO_PATH="${HOME}/homelab-codex-ws"
|
DRY_RUN=false
|
||||||
export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
NO_GATE=false
|
||||||
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
|
|
||||||
|
|
||||||
# --- Initialization ---
|
usage() {
|
||||||
mkdir -p "$STATE_DIR" "$LOG_DIR"
|
cat >&2 <<'EOF'
|
||||||
|
Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||||
|
|
||||||
# Redirection for logging
|
Targets:
|
||||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
control-plane observer/supervisor/executor/operator-ui on VPS
|
||||||
|
vps all VPS GitOps services
|
||||||
|
piha PIHA services
|
||||||
|
solaria SOLARIA compute services
|
||||||
|
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
|
||||||
|
|
||||||
# --- Load Libraries ---
|
Flags:
|
||||||
LIB_PATH="${REPO_PATH}/scripts/lib"
|
--dry-run run preflight + gate only; stop before deploy
|
||||||
source "${LIB_PATH}/log.sh"
|
--no-gate skip pytest + docker build (emergency only; logged as WARNING)
|
||||||
source "${LIB_PATH}/state.sh"
|
|
||||||
source "${LIB_PATH}/inventory.sh"
|
|
||||||
source "${LIB_PATH}/compose.sh"
|
|
||||||
source "${LIB_PATH}/diagnostics.sh"
|
|
||||||
|
|
||||||
# --- CLI Parsing ---
|
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||||
TARGET_HOST=$(hostname)
|
EOF
|
||||||
TARGET_SERVICE=""
|
exit 1
|
||||||
RESUME=false
|
}
|
||||||
REQUESTED_STAGE=""
|
|
||||||
|
|
||||||
while [[ $# -gt 0 ]]; do
|
while [[ $# -gt 0 ]]; do
|
||||||
case $1 in
|
case $1 in
|
||||||
--host)
|
control-plane|vps|piha|solaria|chelsty-infra)
|
||||||
TARGET_HOST="$2"
|
TARGET="$1"; shift ;;
|
||||||
shift 2
|
--dry-run)
|
||||||
;;
|
DRY_RUN=true; shift ;;
|
||||||
--service)
|
--no-gate)
|
||||||
TARGET_SERVICE="$2"
|
NO_GATE=true; shift ;;
|
||||||
shift 2
|
-h|--help)
|
||||||
;;
|
usage ;;
|
||||||
--resume)
|
|
||||||
RESUME=true
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--stage)
|
|
||||||
REQUESTED_STAGE="$2"
|
|
||||||
shift 2
|
|
||||||
;;
|
|
||||||
*)
|
*)
|
||||||
if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
|
echo "Unknown argument: $1" >&2
|
||||||
REQUESTED_STAGE="$1"
|
usage ;;
|
||||||
fi
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
esac
|
esac
|
||||||
done
|
done
|
||||||
|
|
||||||
# --- Stages ---
|
[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
|
||||||
|
|
||||||
stage_prepare() {
|
case "$TARGET" in
|
||||||
local host=$1
|
control-plane) SSH_HOST="vps" ;;
|
||||||
if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
|
*) SSH_HOST="$TARGET" ;;
|
||||||
log "INFO" "Skipping PREPARE (already complete)"
|
esac
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: PREPARE ($host)"
|
case "$TARGET" in
|
||||||
set_stage "prepare"
|
chelsty-*) SSH_TIMEOUT=30 ;;
|
||||||
|
*) SSH_TIMEOUT=5 ;;
|
||||||
|
esac
|
||||||
|
|
||||||
emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
|
# ── PREFLIGHT ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
cd "$REPO_PATH" || exit 1
|
preflight() {
|
||||||
log "INFO" "Pulling latest changes..."
|
echo "=== PREFLIGHT ==="
|
||||||
if ! git pull; then
|
|
||||||
log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Ensure runtime directories exist
|
local branch
|
||||||
mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
|
branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
|
||||||
|
if [[ "$branch" != "master" ]]; then
|
||||||
struct_log "prepare" "$host" "all" "success" "repo_updated"
|
echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
|
||||||
mark_stage_complete "prepare"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_validate() {
|
|
||||||
local host=$1
|
|
||||||
if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
|
|
||||||
log "INFO" "Skipping VALIDATE (already complete)"
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: VALIDATE ($host)"
|
|
||||||
set_stage "validate"
|
|
||||||
|
|
||||||
for service in "${SERVICES[@]}"; do
|
|
||||||
log "INFO" "Validating $service..."
|
|
||||||
if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
|
|
||||||
log "ERROR" "Service definition not found: $service"
|
|
||||||
struct_log "validate" "$host" "$service" "fail" "not_found"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
struct_log "validate" "$host" "all" "success" "validated"
|
|
||||||
mark_stage_complete "validate"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_deploy() {
|
|
||||||
local host=$1
|
|
||||||
if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
|
|
||||||
log "INFO" "Skipping DEPLOY (already complete)"
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: DEPLOY ($host)"
|
|
||||||
set_stage "deploy"
|
|
||||||
|
|
||||||
local last_s=$(get_last_service)
|
|
||||||
local skip=false
|
|
||||||
if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
|
|
||||||
skip=true
|
|
||||||
fi
|
|
||||||
|
|
||||||
for service in "${SERVICES[@]}"; do
|
|
||||||
if [[ "$skip" == "true" ]]; then
|
|
||||||
if [[ "$service" == "$last_s" ]]; then
|
|
||||||
skip=false
|
|
||||||
log "INFO" "Resuming from $service..."
|
|
||||||
else
|
|
||||||
log "INFO" "Skipping $service (already processed)"
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Deploying $service..."
|
|
||||||
set_last_service "$service"
|
|
||||||
|
|
||||||
if ! run_compose_up "$service"; then
|
|
||||||
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
|
|
||||||
collect_diagnostics "$host" "$service"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
struct_log "deploy" "$host" "$service" "success" "deployed"
|
|
||||||
done
|
|
||||||
|
|
||||||
set_last_service ""
|
|
||||||
mark_stage_complete "deploy"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_verify() {
|
|
||||||
local host=$1
|
|
||||||
if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
|
|
||||||
log "INFO" "Skipping VERIFY (already complete)"
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: VERIFY ($host)"
|
|
||||||
set_stage "verify"
|
|
||||||
|
|
||||||
for service in "${SERVICES[@]}"; do
|
|
||||||
log "INFO" "Verifying $service..."
|
|
||||||
local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
|
|
||||||
if [[ -f "$health_script" ]]; then
|
|
||||||
if ! bash "$health_script"; then
|
|
||||||
log "ERROR" "Healthcheck failed for $service"
|
|
||||||
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
|
|
||||||
collect_diagnostics "$host" "$service"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
# Generic check if container is running
|
|
||||||
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
|
|
||||||
log "ERROR" "Container $service is not running"
|
|
||||||
struct_log "verify" "$host" "$service" "fail" "container_not_running"
|
|
||||||
collect_diagnostics "$host" "$service"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
struct_log "verify" "$host" "$service" "success" "verified"
|
|
||||||
done
|
|
||||||
mark_stage_complete "verify"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_complete() {
|
|
||||||
local host=$1
|
|
||||||
log "INFO" "Stage: COMPLETE ($host)"
|
|
||||||
set_stage "complete"
|
|
||||||
struct_log "complete" "$host" "all" "success" "deployment_finished"
|
|
||||||
clear_deployment_state
|
|
||||||
}
|
|
||||||
|
|
||||||
# --- Execution Logic ---
|
|
||||||
|
|
||||||
run_deployment() {
|
|
||||||
local start_stage=$1
|
|
||||||
|
|
||||||
# Sequential execution from start_stage
|
|
||||||
case "$start_stage" in
|
|
||||||
prepare)
|
|
||||||
stage_prepare "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
validate)
|
|
||||||
stage_validate "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
deploy)
|
|
||||||
stage_deploy "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
verify)
|
|
||||||
stage_verify "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
complete)
|
|
||||||
stage_complete "$TARGET_HOST" || return 1
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
log "ERROR" "Invalid stage: $start_stage"
|
|
||||||
return 1
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
}
|
|
||||||
|
|
||||||
# --- Main ---
|
|
||||||
|
|
||||||
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
|
|
||||||
|
|
||||||
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
|
|
||||||
log "ERROR" "Failed to load inventory"
|
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
|
||||||
|
|
||||||
EXIT_STATUS=0
|
|
||||||
if [[ "$RESUME" == "true" ]]; then
|
|
||||||
CURRENT=$(get_stage)
|
|
||||||
log "INFO" "Resuming from state: $CURRENT"
|
|
||||||
case "$CURRENT" in
|
|
||||||
prepare|validate|deploy|verify)
|
|
||||||
run_deployment "$CURRENT" || EXIT_STATUS=1
|
|
||||||
;;
|
|
||||||
complete|none)
|
|
||||||
log "INFO" "No interrupted deployment found. Starting from scratch..."
|
|
||||||
run_deployment "prepare" || EXIT_STATUS=1
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
log "INFO" "Unknown state. Starting from prepare..."
|
|
||||||
run_deployment "prepare" || EXIT_STATUS=1
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
elif [[ -n "$REQUESTED_STAGE" ]]; then
|
|
||||||
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
|
|
||||||
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
|
|
||||||
else
|
|
||||||
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
|
|
||||||
fi
|
fi
|
||||||
else
|
echo "[ok] branch: master"
|
||||||
# New deployment - clear previous state
|
|
||||||
clear_deployment_state
|
if ! git -C "$REPO_ROOT" diff --quiet; then
|
||||||
run_deployment "prepare" || EXIT_STATUS=1
|
echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
|
||||||
|
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "[ok] working tree clean"
|
||||||
|
|
||||||
|
git -C "$REPO_ROOT" fetch origin master --quiet
|
||||||
|
local unpushed
|
||||||
|
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
|
||||||
|
if [[ -n "$unpushed" ]]; then
|
||||||
|
echo "ERROR: Unpushed commits on master:" >&2
|
||||||
|
echo "$unpushed" >&2
|
||||||
|
echo "Push first: git push origin master" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "[ok] no unpushed commits"
|
||||||
|
|
||||||
|
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
|
||||||
|
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||||
|
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
|
||||||
|
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "[ok] ${SSH_HOST} reachable"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── GATE ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
gate() {
|
||||||
|
if [[ "$NO_GATE" == "true" ]]; then
|
||||||
|
echo "=== GATE: SKIPPED ==="
|
||||||
|
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "=== GATE ==="
|
||||||
|
|
||||||
|
local services=()
|
||||||
|
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
services=("control-plane")
|
||||||
|
else
|
||||||
|
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
|
||||||
|
if [[ ! -f "$svc_yaml" ]]; then
|
||||||
|
echo "ERROR: ${svc_yaml} not found." >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
local svc_list
|
||||||
|
svc_list=$(python3 -c "
|
||||||
|
import yaml
|
||||||
|
with open('${svc_yaml}') as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
svcs = data.get('services', {})
|
||||||
|
if isinstance(svcs, dict):
|
||||||
|
print('\n'.join(svcs.keys()))
|
||||||
|
elif isinstance(svcs, list):
|
||||||
|
print('\n'.join(svcs))
|
||||||
|
")
|
||||||
|
while IFS= read -r svc; do
|
||||||
|
[[ -z "$svc" ]] && continue
|
||||||
|
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
|
||||||
|
services+=("$svc")
|
||||||
|
fi
|
||||||
|
done <<< "$svc_list"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ${#services[@]} -eq 0 ]]; then
|
||||||
|
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Services under gate: ${services[*]}"
|
||||||
|
local gate_failed=false
|
||||||
|
|
||||||
|
for svc in "${services[@]}"; do
|
||||||
|
local svc_dir="${REPO_ROOT}/services/${svc}"
|
||||||
|
|
||||||
|
if [[ -d "${svc_dir}/tests" ]]; then
|
||||||
|
echo "--- pytest: ${svc} ---"
|
||||||
|
if ! python3 -m pytest "${svc_dir}/tests" -q; then
|
||||||
|
echo "GATE FAIL: pytest failed for ${svc}" >&2
|
||||||
|
gate_failed=true
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "--- docker build: ${svc} ---"
|
||||||
|
if ! docker build --quiet "${svc_dir}" >/dev/null; then
|
||||||
|
echo "GATE FAIL: docker build failed for ${svc}" >&2
|
||||||
|
gate_failed=true
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$gate_failed" == "true" ]]; then
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
echo "[ok] gate passed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── EXECUTE ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
execute() {
|
||||||
|
echo "=== EXECUTE ==="
|
||||||
|
|
||||||
|
local cmd_output
|
||||||
|
local cmd_exit=0
|
||||||
|
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
echo "Running deploy-control-plane.sh --ssh..."
|
||||||
|
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|
||||||
|
|| cmd_exit=$?
|
||||||
|
else
|
||||||
|
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
|
||||||
|
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||||
|
"${SSH_USER}@${SSH_HOST}" \
|
||||||
|
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|
||||||
|
|| cmd_exit=$?
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "$cmd_output"
|
||||||
|
|
||||||
|
if echo "$cmd_output" | grep -qF "[sudo] password"; then
|
||||||
|
echo "" >&2
|
||||||
|
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
|
||||||
|
echo "Run manually:" >&2
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
|
||||||
|
else
|
||||||
|
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
|
||||||
|
fi
|
||||||
|
exit 5
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $cmd_exit -ne 0 ]]; then
|
||||||
|
echo "ERROR: Deploy command exited ${cmd_exit}." >&2
|
||||||
|
exit 3
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[ok] execute completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── VERIFY ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
verify() {
|
||||||
|
echo "=== VERIFY ==="
|
||||||
|
|
||||||
|
local ps_output
|
||||||
|
local ps_exit=0
|
||||||
|
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||||
|
"${SSH_USER}@${SSH_HOST}" \
|
||||||
|
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|
||||||
|
|| ps_exit=$?
|
||||||
|
|
||||||
|
if [[ $ps_exit -ne 0 ]]; then
|
||||||
|
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
|
||||||
|
echo "$ps_output" >&2
|
||||||
|
exit 4
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "$ps_output"
|
||||||
|
|
||||||
|
local failed=false
|
||||||
|
|
||||||
|
local not_up
|
||||||
|
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
|
||||||
|
if [[ -n "$not_up" ]]; then
|
||||||
|
echo "ERROR: Containers not in Up state:" >&2
|
||||||
|
echo "$not_up" >&2
|
||||||
|
failed=true
|
||||||
|
fi
|
||||||
|
|
||||||
|
local unhealthy
|
||||||
|
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
|
||||||
|
if [[ -n "$unhealthy" ]]; then
|
||||||
|
echo "ERROR: Unhealthy containers:" >&2
|
||||||
|
echo "$unhealthy" >&2
|
||||||
|
failed=true
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
for cp_svc in supervisor observer executor operator-ui; do
|
||||||
|
if ! echo "$ps_output" | grep -q "$cp_svc"; then
|
||||||
|
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
|
||||||
|
failed=true
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$failed" == "true" ]]; then
|
||||||
|
echo "" >&2
|
||||||
|
echo "Full docker ps output above." >&2
|
||||||
|
exit 4
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[ok] all containers healthy"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── REPORT ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
report() {
|
||||||
|
local mode="${1:-deploy}"
|
||||||
|
local end_time
|
||||||
|
end_time=$(date +%s)
|
||||||
|
local elapsed
|
||||||
|
elapsed=$(( end_time - START_TIME ))
|
||||||
|
local commit_hash
|
||||||
|
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
|
||||||
|
local gate_s verify_s
|
||||||
|
|
||||||
|
if [[ "$NO_GATE" == "true" ]]; then
|
||||||
|
gate_s="skip"
|
||||||
|
else
|
||||||
|
gate_s="ok"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$mode" == "dry-run" ]]; then
|
||||||
|
verify_s="skip(dry-run)"
|
||||||
|
else
|
||||||
|
verify_s="green"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
if [[ "$mode" == "dry-run" ]]; then
|
||||||
|
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||||
|
else
|
||||||
|
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── MAIN ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
preflight
|
||||||
|
gate
|
||||||
|
|
||||||
|
if [[ "$DRY_RUN" == "true" ]]; then
|
||||||
|
report dry-run
|
||||||
|
exit 0
|
||||||
fi
|
fi
|
||||||
|
|
||||||
if [[ $EXIT_STATUS -eq 0 ]]; then
|
execute
|
||||||
print_summary "$TARGET_HOST" "SUCCESS"
|
verify
|
||||||
log "INFO" "--- Homelab Deployment Finished Successfully ---"
|
report
|
||||||
else
|
|
||||||
print_summary "$TARGET_HOST" "FAILED"
|
|
||||||
log "ERROR" "--- Homelab Deployment Failed ---"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
|
||||||
361
scripts/dev/agent.sh
Executable file
361
scripts/dev/agent.sh
Executable file
|
|
@ -0,0 +1,361 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Multi-agent worktree manager.
|
||||||
|
# EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
|
||||||
|
|
||||||
|
RESERVED_NAMES=(master main HEAD list merge clean new)
|
||||||
|
MAX_WORKTREES=4
|
||||||
|
|
||||||
|
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
|
||||||
|
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
|
||||||
|
|
||||||
|
# ── helpers ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
is_main_checkout() {
|
||||||
|
local git_dir common_dir
|
||||||
|
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
|
||||||
|
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
|
||||||
|
[ "$git_dir" = "$common_dir" ]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_main_checkout() {
|
||||||
|
is_main_checkout || prefail "must run from the main checkout, not a worktree"
|
||||||
|
}
|
||||||
|
|
||||||
|
require_master_branch() {
|
||||||
|
local branch
|
||||||
|
branch=$(git rev-parse --abbrev-ref HEAD)
|
||||||
|
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
|
||||||
|
}
|
||||||
|
|
||||||
|
require_clean_tree() {
|
||||||
|
local dirty
|
||||||
|
dirty=$(git status --porcelain)
|
||||||
|
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
|
||||||
|
}
|
||||||
|
|
||||||
|
worktree_paths() {
|
||||||
|
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
|
||||||
|
local main_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
git worktree list --porcelain \
|
||||||
|
| awk '/^worktree /{p=$2} /^$/{print p}' \
|
||||||
|
| grep -v "^${main_path}$" \
|
||||||
|
|| true
|
||||||
|
}
|
||||||
|
|
||||||
|
worktree_count() {
|
||||||
|
worktree_paths | wc -l
|
||||||
|
}
|
||||||
|
|
||||||
|
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
|
||||||
|
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
|
||||||
|
|
||||||
|
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
|
||||||
|
|
||||||
|
age_str() {
|
||||||
|
local created_utc="$1"
|
||||||
|
local now_ts created_ts diff_s
|
||||||
|
now_ts=$(date -u +%s)
|
||||||
|
# strip Z, replace T with space for `date -d`
|
||||||
|
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
|
||||||
|
diff_s=$(( now_ts - created_ts ))
|
||||||
|
if (( diff_s < 60 )); then echo "${diff_s}s"
|
||||||
|
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
|
||||||
|
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
|
||||||
|
else echo "$(( diff_s/86400 ))d"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
validate_name() {
|
||||||
|
local name="$1"
|
||||||
|
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
|
||||||
|
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
|
||||||
|
fi
|
||||||
|
for r in "${RESERVED_NAMES[@]}"; do
|
||||||
|
if [ "$name" = "$r" ]; then
|
||||||
|
prefail "'$name' is a reserved word"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── subcommands ───────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
cmd_new() {
|
||||||
|
local name="${1:-}"
|
||||||
|
[ -n "$name" ] || { usage; exit 1; }
|
||||||
|
|
||||||
|
validate_name "$name"
|
||||||
|
require_main_checkout
|
||||||
|
require_master_branch
|
||||||
|
require_clean_tree
|
||||||
|
|
||||||
|
# worktree limit
|
||||||
|
local count
|
||||||
|
count=$(worktree_count)
|
||||||
|
if (( count >= MAX_WORKTREES )); then
|
||||||
|
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
|
||||||
|
cmd_list
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# branch collision
|
||||||
|
if branch_exists_local "task/$name"; then
|
||||||
|
prefail "branch task/$name already exists locally"
|
||||||
|
fi
|
||||||
|
git fetch origin master --quiet
|
||||||
|
if branch_exists_remote "refs/heads/task/$name"; then
|
||||||
|
prefail "branch task/$name already exists on origin"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# directory collision
|
||||||
|
local main_path wt_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||||
|
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
|
||||||
|
|
||||||
|
# create worktree
|
||||||
|
git worktree add -b "task/$name" "$wt_path" origin/master \
|
||||||
|
|| die "git worktree add failed"
|
||||||
|
|
||||||
|
# write marker
|
||||||
|
local parent_commit
|
||||||
|
parent_commit=$(git rev-parse origin/master)
|
||||||
|
cat > "$wt_path/.agent-task" <<EOF
|
||||||
|
task: $name
|
||||||
|
branch: task/$name
|
||||||
|
parent_commit: $parent_commit
|
||||||
|
created_utc: $(utc_now)
|
||||||
|
worktree_path: $wt_path
|
||||||
|
EOF
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Worktree created: $wt_path"
|
||||||
|
echo "Branch: task/$name"
|
||||||
|
echo ""
|
||||||
|
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
|
||||||
|
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
|
||||||
|
echo "─────────────────────────────────────────────────────────────────────────────"
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_list() {
|
||||||
|
local main_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
|
||||||
|
# fetch to get up-to-date ahead/behind
|
||||||
|
git fetch origin master --quiet 2>/dev/null || true
|
||||||
|
|
||||||
|
local paths
|
||||||
|
paths=$(worktree_paths)
|
||||||
|
|
||||||
|
if [ -z "$paths" ]; then
|
||||||
|
echo "(no active task worktrees)"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||||
|
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
|
||||||
|
|
||||||
|
while IFS= read -r wt_path; do
|
||||||
|
[ -z "$wt_path" ] && continue
|
||||||
|
|
||||||
|
local marker="$wt_path/.agent-task"
|
||||||
|
local task_name branch parent_commit created_utc
|
||||||
|
if [ -f "$marker" ]; then
|
||||||
|
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
|
||||||
|
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
|
||||||
|
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
|
||||||
|
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
|
||||||
|
else
|
||||||
|
task_name="(no marker)"
|
||||||
|
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
|
||||||
|
parent_commit="?"
|
||||||
|
created_utc=""
|
||||||
|
fi
|
||||||
|
|
||||||
|
local status="clean"
|
||||||
|
local dirty
|
||||||
|
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
|
||||||
|
[ -n "$dirty" ] && status="dirty"
|
||||||
|
|
||||||
|
local ahead behind ab
|
||||||
|
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
|
||||||
|
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
|
||||||
|
ab="+${ahead}/-${behind}"
|
||||||
|
|
||||||
|
local age=""
|
||||||
|
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
|
||||||
|
|
||||||
|
local short_parent="${parent_commit:0:7}"
|
||||||
|
local short_created="${created_utc:0:10}"
|
||||||
|
|
||||||
|
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||||
|
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
|
||||||
|
done <<< "$paths"
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_merge() {
|
||||||
|
local name="${1:-}"
|
||||||
|
[ -n "$name" ] || { usage; exit 1; }
|
||||||
|
|
||||||
|
require_main_checkout
|
||||||
|
require_master_branch
|
||||||
|
require_clean_tree
|
||||||
|
|
||||||
|
git fetch origin --quiet
|
||||||
|
|
||||||
|
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
|
||||||
|
|
||||||
|
local main_path wt_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||||
|
|
||||||
|
# attempt ff-only merge
|
||||||
|
local merge_failed=0
|
||||||
|
git merge --ff-only "task/$name" || merge_failed=1
|
||||||
|
|
||||||
|
if (( merge_failed )); then
|
||||||
|
# abort any partial merge state
|
||||||
|
git merge --abort 2>/dev/null || true
|
||||||
|
echo ""
|
||||||
|
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
|
||||||
|
echo " The branch has likely diverged from master." >&2
|
||||||
|
echo "" >&2
|
||||||
|
echo "Diagnose with:" >&2
|
||||||
|
echo " git log master..task/$name # commits only on task branch" >&2
|
||||||
|
echo " git log task/$name..master # commits master has that task doesn't" >&2
|
||||||
|
echo "" >&2
|
||||||
|
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
|
||||||
|
echo "Worktree and branch are preserved — no changes made." >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Merged task/$name into master (fast-forward)."
|
||||||
|
|
||||||
|
git push origin master || die "git push origin master failed"
|
||||||
|
echo "Pushed master to origin."
|
||||||
|
|
||||||
|
if [ -d "$wt_path" ]; then
|
||||||
|
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
|
||||||
|
echo "Removed worktree: $wt_path"
|
||||||
|
else
|
||||||
|
echo "(worktree directory $wt_path not found — skipping worktree remove)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
git branch -d "task/$name" || die "git branch -d task/$name failed"
|
||||||
|
echo "Deleted local branch task/$name."
|
||||||
|
|
||||||
|
git push origin --delete "task/$name" 2>/dev/null \
|
||||||
|
&& echo "Deleted remote branch task/$name." \
|
||||||
|
|| echo "(remote branch task/$name not found — nothing to delete)"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Done. task/$name merged and cleaned up."
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_clean() {
|
||||||
|
local main_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
git fetch origin --quiet 2>/dev/null || true
|
||||||
|
|
||||||
|
local to_remove=()
|
||||||
|
|
||||||
|
# orphaned registered worktrees: branch deleted or fully merged into master
|
||||||
|
local paths
|
||||||
|
paths=$(worktree_paths)
|
||||||
|
while IFS= read -r wt_path; do
|
||||||
|
[ -z "$wt_path" ] && continue
|
||||||
|
local branch
|
||||||
|
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
|
||||||
|
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
|
||||||
|
|
||||||
|
# branch gone locally?
|
||||||
|
if ! branch_exists_local "$branch"; then
|
||||||
|
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
# branch fully merged into master?
|
||||||
|
local ahead
|
||||||
|
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
|
||||||
|
if [ "$ahead" = "0" ]; then
|
||||||
|
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
|
||||||
|
fi
|
||||||
|
done <<< "$paths"
|
||||||
|
|
||||||
|
# dangling directories: ../homelab-codex-ws-* not registered
|
||||||
|
local registered_paths
|
||||||
|
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
|
||||||
|
local parent_dir
|
||||||
|
parent_dir=$(dirname "$main_path")
|
||||||
|
while IFS= read -r candidate; do
|
||||||
|
[ -d "$candidate" ] || continue
|
||||||
|
if ! echo "$registered_paths" | grep -qF "$candidate"; then
|
||||||
|
to_remove+=("dangling:$candidate")
|
||||||
|
fi
|
||||||
|
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
|
||||||
|
|
||||||
|
if [ ${#to_remove[@]} -eq 0 ]; then
|
||||||
|
echo "Nothing to clean."
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Found ${#to_remove[@]} item(s) to clean:"
|
||||||
|
for entry in "${to_remove[@]}"; do
|
||||||
|
echo " $entry"
|
||||||
|
done
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
local overall_rc=0
|
||||||
|
for entry in "${to_remove[@]}"; do
|
||||||
|
local kind="${entry%%:*}"
|
||||||
|
local path="${entry#*:}"
|
||||||
|
# strip trailing annotation in parens
|
||||||
|
local raw_path
|
||||||
|
raw_path="${path%% (*}"
|
||||||
|
|
||||||
|
local confirm
|
||||||
|
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
|
||||||
|
if [[ "$confirm" =~ ^[Yy]$ ]]; then
|
||||||
|
if [ "$kind" = "worktree" ]; then
|
||||||
|
git worktree remove --force "$raw_path" 2>/dev/null \
|
||||||
|
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
|
||||||
|
else
|
||||||
|
rm -rf "$raw_path"
|
||||||
|
fi
|
||||||
|
echo " Removed."
|
||||||
|
else
|
||||||
|
echo " Skipped."
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
return $overall_rc
|
||||||
|
}
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: agent.sh <subcommand> [args]
|
||||||
|
|
||||||
|
agent.sh new <name> Create a new task worktree (branch task/<name>)
|
||||||
|
agent.sh list List active task worktrees with status
|
||||||
|
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
|
||||||
|
agent.sh clean Remove orphaned or dangling worktrees (interactive)
|
||||||
|
|
||||||
|
EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── dispatch ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
SUBCOMMAND="${1:-}"
|
||||||
|
shift || true
|
||||||
|
|
||||||
|
case "$SUBCOMMAND" in
|
||||||
|
new) cmd_new "$@" ;;
|
||||||
|
list) cmd_list "$@" ;;
|
||||||
|
merge) cmd_merge "$@" ;;
|
||||||
|
clean) cmd_clean "$@" ;;
|
||||||
|
*) usage; exit 1 ;;
|
||||||
|
esac
|
||||||
|
|
@ -7,6 +7,34 @@ import yaml
|
||||||
from datetime import datetime, timezone
|
from datetime import datetime, timezone
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def _atomic_write_json(path: Path, data) -> None:
|
||||||
|
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||||
|
tmp = path.with_suffix(".tmp")
|
||||||
|
with open(tmp, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
f.flush()
|
||||||
|
os.fsync(f.fileno())
|
||||||
|
os.replace(tmp, path)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_ts(ts) -> float:
|
||||||
|
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
|
||||||
|
|
||||||
|
Events from node-agent use int(time.time()); events from stability-agent / events.py
|
||||||
|
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
|
||||||
|
last_occurrence and resolved_at, so any arithmetic on them must go through here.
|
||||||
|
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
|
||||||
|
"""
|
||||||
|
if ts is None:
|
||||||
|
return 0.0
|
||||||
|
if isinstance(ts, (int, float)):
|
||||||
|
return float(ts)
|
||||||
|
try:
|
||||||
|
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
|
||||||
|
except Exception:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
# Constants and Paths
|
# Constants and Paths
|
||||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
||||||
|
|
@ -124,8 +152,7 @@ class Observer:
|
||||||
|
|
||||||
def _save_checkpoint(self):
|
def _save_checkpoint(self):
|
||||||
try:
|
try:
|
||||||
with open(OBSERVER_STATE_FILE, "w") as f:
|
_atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
|
||||||
json.dump({"node_checkpoints": self.node_checkpoints}, f, indent=2)
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Failed to save checkpoint: {e}")
|
logger.error(f"Failed to save checkpoint: {e}")
|
||||||
|
|
||||||
|
|
@ -173,12 +200,68 @@ class Observer:
|
||||||
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
|
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
|
||||||
del self.world_state["services"][k]
|
del self.world_state["services"][k]
|
||||||
|
|
||||||
# Remove resolved incidents older than 7 days.
|
|
||||||
now = time.time()
|
now = time.time()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Collect incident_ids currently referenced by any service entry.
|
||||||
|
linked_ids: set = {
|
||||||
|
svc.get("incident_id")
|
||||||
|
for svc in self.world_state["services"].values()
|
||||||
|
if svc.get("incident_id")
|
||||||
|
}
|
||||||
|
|
||||||
|
# Case 1 — service is healthy but still points at an active incident.
|
||||||
|
# process_event already calls _resolve_incident on service_healthy events,
|
||||||
|
# but if the observer restarted with on-disk state where the link was
|
||||||
|
# intact (inconsistency from a pre-atomic-write crash), it may not get
|
||||||
|
# resolved until the next service_healthy event is processed. Resolve
|
||||||
|
# immediately — a healthy service cannot have an ongoing incident.
|
||||||
|
for svc_key, svc in self.world_state["services"].items():
|
||||||
|
if svc.get("status") != "healthy":
|
||||||
|
continue
|
||||||
|
inc_id = svc.get("incident_id")
|
||||||
|
if not inc_id:
|
||||||
|
continue
|
||||||
|
inc = self.world_state["incidents"].get(inc_id, {})
|
||||||
|
if inc.get("status") == "active":
|
||||||
|
logger.info(
|
||||||
|
f"Auto-resolving incident {inc_id} for {svc_key}: "
|
||||||
|
f"service is healthy"
|
||||||
|
)
|
||||||
|
inc["status"] = "resolved"
|
||||||
|
inc["resolved_at"] = now
|
||||||
|
svc["incident_id"] = None
|
||||||
|
linked_ids.discard(inc_id)
|
||||||
|
|
||||||
|
# Case 2 — orphaned active incident: no service entry links to it and
|
||||||
|
# last_occurrence is older than 5 minutes (guard against creation races).
|
||||||
|
# These are the stale records left behind when on-disk state was
|
||||||
|
# inconsistent: the service entry had incident_id cleared but incidents.json
|
||||||
|
# still had the record as "active".
|
||||||
|
for inc_id, inc in self.world_state["incidents"].items():
|
||||||
|
if inc.get("status") != "active":
|
||||||
|
continue
|
||||||
|
if inc_id in linked_ids:
|
||||||
|
continue
|
||||||
|
age = now - _parse_ts(inc.get("last_occurrence"))
|
||||||
|
if age > 300: # 5-minute guard
|
||||||
|
logger.info(
|
||||||
|
f"Auto-resolving orphaned incident {inc_id} "
|
||||||
|
f"(service={inc.get('service')}, node={inc.get('node')}): "
|
||||||
|
f"no service references it, age={int(age)}s"
|
||||||
|
)
|
||||||
|
inc["status"] = "resolved"
|
||||||
|
inc["resolved_at"] = now
|
||||||
|
|
||||||
|
except Exception as exc:
|
||||||
|
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
|
||||||
|
|
||||||
|
# Remove resolved incidents older than 7 days.
|
||||||
|
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
|
||||||
stale_incidents = [
|
stale_incidents = [
|
||||||
k for k, v in self.world_state["incidents"].items()
|
k for k, v in self.world_state["incidents"].items()
|
||||||
if v.get("status") == "resolved"
|
if v.get("status") == "resolved"
|
||||||
and (now - (v.get("resolved_at") or now)) > 7 * 86400
|
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
|
||||||
]
|
]
|
||||||
for k in stale_incidents:
|
for k in stale_incidents:
|
||||||
del self.world_state["incidents"][k]
|
del self.world_state["incidents"][k]
|
||||||
|
|
@ -202,13 +285,12 @@ class Observer:
|
||||||
"services.json": self.world_state["services"],
|
"services.json": self.world_state["services"],
|
||||||
"deployments.json": self.world_state["deployments"],
|
"deployments.json": self.world_state["deployments"],
|
||||||
"incidents.json": self.world_state["incidents"],
|
"incidents.json": self.world_state["incidents"],
|
||||||
"recommendations.json": [], # Placeholder to satisfy requirements
|
"recommendations.json": [],
|
||||||
"runtime-summary.json": self.world_state["summary"]
|
"runtime-summary.json": self.world_state["summary"]
|
||||||
}
|
}
|
||||||
for filename, data in files.items():
|
for filename, data in files.items():
|
||||||
try:
|
try:
|
||||||
with open(WORLD_DIR / filename, "w") as f:
|
_atomic_write_json(WORLD_DIR / filename, data)
|
||||||
json.dump(data, f, indent=2)
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Failed to save {filename}: {e}")
|
logger.error(f"Failed to save {filename}: {e}")
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,110 +0,0 @@
|
||||||
services:
|
|
||||||
codex-worker:
|
|
||||||
image: ai-cluster-codex-worker
|
|
||||||
restart: unless-stopped
|
|
||||||
environment:
|
|
||||||
- AGENT_ID=vps-dev-1
|
|
||||||
- ROLE=dev
|
|
||||||
- MQTT_HOST=mosquitto
|
|
||||||
- MQTT_PORT=1883
|
|
||||||
- MQTT_USERNAME=${MQTT_USERNAME:-codex}
|
|
||||||
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
|
||||||
- GATEWAY_BASE_URL=${GATEWAY_BASE_URL:-http://piha:8080}
|
|
||||||
- REQUEST_TIMEOUT_SECONDS=30
|
|
||||||
command: ["python", "worker.py"]
|
|
||||||
networks:
|
|
||||||
- internal
|
|
||||||
|
|
||||||
openclaw:
|
|
||||||
image: ai-cluster-openclaw
|
|
||||||
restart: unless-stopped
|
|
||||||
environment:
|
|
||||||
- MQTT_HOST=mosquitto
|
|
||||||
- MQTT_PORT=1883
|
|
||||||
- MQTT_USERNAME=${MQTT_USERNAME:-codex}
|
|
||||||
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
|
||||||
command: ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
|
||||||
ports:
|
|
||||||
- "8000:8000"
|
|
||||||
networks:
|
|
||||||
- internal
|
|
||||||
- npm_default
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "wget", "-qO-", "http://localhost:8000/health"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 15s
|
|
||||||
|
|
||||||
planner-worker:
|
|
||||||
image: ai-cluster-planner-worker
|
|
||||||
restart: unless-stopped
|
|
||||||
environment:
|
|
||||||
- AGENT_ID=vps-planner-1
|
|
||||||
- ROLE=planner
|
|
||||||
- MQTT_HOST=mosquitto
|
|
||||||
- MQTT_PORT=1883
|
|
||||||
- MQTT_USERNAME=${MQTT_USERNAME:-codex}
|
|
||||||
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
|
||||||
command: ["python", "planner_worker.py"]
|
|
||||||
networks:
|
|
||||||
- internal
|
|
||||||
|
|
||||||
service-ops-worker:
|
|
||||||
image: ai-cluster-service-ops-worker
|
|
||||||
restart: unless-stopped
|
|
||||||
environment:
|
|
||||||
- AGENT_ID=vps-service-ops-1
|
|
||||||
- ROLE=service-ops
|
|
||||||
- MQTT_HOST=mosquitto
|
|
||||||
- MQTT_PORT=1883
|
|
||||||
- MQTT_USERNAME=${MQTT_USERNAME:-codex}
|
|
||||||
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
|
||||||
- COMPOSE_PROJECT_NAME=ai-cluster
|
|
||||||
command: ["python", "service_ops_worker.py"]
|
|
||||||
volumes:
|
|
||||||
# Post-migration: compose definition and env are in the repo/runtime paths.
|
|
||||||
# Pre-cutover these are overridden to old paths via docker-compose.override.yml.
|
|
||||||
- /home/oskar/homelab-codex-ws/services/ai-cluster/docker-compose.yml:/app/docker-compose.yml:ro
|
|
||||||
- /opt/homelab/config/ai-cluster/.env:/app/.env:ro
|
|
||||||
- /var/run/docker.sock:/var/run/docker.sock:rw
|
|
||||||
networks:
|
|
||||||
- internal
|
|
||||||
|
|
||||||
redis:
|
|
||||||
image: redis:7-alpine
|
|
||||||
restart: unless-stopped
|
|
||||||
command: ["redis-server"]
|
|
||||||
volumes:
|
|
||||||
- redis_data:/data
|
|
||||||
networks:
|
|
||||||
- internal
|
|
||||||
|
|
||||||
mosquitto:
|
|
||||||
image: eclipse-mosquitto:2
|
|
||||||
container_name: mosquitto
|
|
||||||
restart: unless-stopped
|
|
||||||
command: ["/usr/sbin/mosquitto", "-c", "/mosquitto/config/mosquitto.conf"]
|
|
||||||
ports:
|
|
||||||
# Tailscale IP binding — matches running container
|
|
||||||
- "100.95.58.48:1883:1883"
|
|
||||||
volumes:
|
|
||||||
# Config: kept at old path until mosquitto config migration is complete
|
|
||||||
- /home/dockeruser/docker/ai-cluster/mosquitto:/mosquitto/config:ro
|
|
||||||
- mosquitto_data:/mosquitto/data
|
|
||||||
- mosquitto_log:/mosquitto/log
|
|
||||||
networks:
|
|
||||||
- internal
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
redis_data:
|
|
||||||
mosquitto_data:
|
|
||||||
mosquitto_log:
|
|
||||||
|
|
||||||
networks:
|
|
||||||
internal:
|
|
||||||
driver: bridge
|
|
||||||
name: ai-cluster_ai-cluster
|
|
||||||
npm_default:
|
|
||||||
external: true
|
|
||||||
name: npm_default
|
|
||||||
|
|
@ -1,14 +0,0 @@
|
||||||
# AI Cluster — /opt/homelab/config/ai-cluster/.env
|
|
||||||
# Read by all worker containers and mounted into service-ops-worker as /app/.env
|
|
||||||
|
|
||||||
# MQTT broker credentials
|
|
||||||
MQTT_HOST=mosquitto
|
|
||||||
MQTT_PORT=1883
|
|
||||||
MQTT_USERNAME=codex
|
|
||||||
MQTT_PASSWORD=
|
|
||||||
|
|
||||||
# API gateway (piha)
|
|
||||||
GATEWAY_BASE_URL=http://piha:8080
|
|
||||||
|
|
||||||
# Compose project name (required for service-ops-worker docker-compose operations)
|
|
||||||
COMPOSE_PROJECT_NAME=ai-cluster
|
|
||||||
|
|
@ -1,15 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
# Healthcheck for AI cluster (checks openclaw API gateway is responding)
|
|
||||||
|
|
||||||
if ! docker ps --filter "name=ai-cluster-openclaw-1" --filter "status=running" | grep -q "openclaw"; then
|
|
||||||
echo "[FAIL] openclaw container is not running"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if ! curl -sf http://localhost:8000/health > /dev/null; then
|
|
||||||
echo "[FAIL] openclaw HTTP health endpoint not responding"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "[OK] ai-cluster is healthy"
|
|
||||||
exit 0
|
|
||||||
|
|
@ -1,37 +0,0 @@
|
||||||
service:
|
|
||||||
name: ai-cluster
|
|
||||||
owner_node: vps
|
|
||||||
exposure: tailscale-internal
|
|
||||||
dependencies:
|
|
||||||
- mosquitto
|
|
||||||
- redis
|
|
||||||
ports:
|
|
||||||
- container: 8000
|
|
||||||
host: 8000
|
|
||||||
protocol: tcp
|
|
||||||
service: openclaw
|
|
||||||
- container: 1883
|
|
||||||
host: 1883
|
|
||||||
protocol: tcp
|
|
||||||
bind: 100.95.58.48 # Tailscale only
|
|
||||||
service: mosquitto
|
|
||||||
healthcheck:
|
|
||||||
type: http
|
|
||||||
endpoint: http://localhost:8000/health
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
restart_policy: unless-stopped
|
|
||||||
persistence:
|
|
||||||
paths:
|
|
||||||
- volume:mosquitto_config_bind # /home/dockeruser/docker/ai-cluster/mosquitto (bind, not volume)
|
|
||||||
runtime:
|
|
||||||
env_file: /opt/homelab/config/ai-cluster/.env
|
|
||||||
env_vars:
|
|
||||||
- MQTT_PASSWORD
|
|
||||||
- MQTT_USERNAME
|
|
||||||
- GATEWAY_BASE_URL
|
|
||||||
notes:
|
|
||||||
- "Local images (ai-cluster-*) must be built on VPS before deployment"
|
|
||||||
- "service-ops-worker mounts docker.sock and the compose file — needs post-migration path update"
|
|
||||||
- "Recommendation: move ai-cluster compute workloads to SOLARIA (GPU/compute node)"
|
|
||||||
|
|
@ -20,4 +20,5 @@ ENV RUNTIME_PATH=/opt/homelab
|
||||||
ENV PYTHONUNBUFFERED=1
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
|
||||||
# Default command (will be overridden in docker-compose)
|
# Default command (will be overridden in docker-compose)
|
||||||
|
USER homelab
|
||||||
CMD ["python", "src/operator_ui.py"]
|
CMD ["python", "src/operator_ui.py"]
|
||||||
|
|
|
||||||
|
|
@ -39,10 +39,24 @@ for dir in "${DIRS[@]}"; do
|
||||||
fi
|
fi
|
||||||
done
|
done
|
||||||
|
|
||||||
# 3. chown/chmod for UID 1000
|
# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed
|
||||||
echo "Setting permissions for UID 1000 on /opt/homelab..."
|
echo "Checking /opt/homelab ownership..."
|
||||||
sudo chown -R 1000:1000 /opt/homelab
|
_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null)
|
||||||
sudo chmod -R 775 /opt/homelab 2>/dev/null || true
|
if [[ -n "$_chown_needed" ]]; then
|
||||||
|
echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
|
||||||
|
sudo chown -R 1000:1000 /opt/homelab
|
||||||
|
else
|
||||||
|
echo "Ownership already correct, skipping chown"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Checking /opt/homelab directory permissions..."
|
||||||
|
_chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
|
||||||
|
if [[ -n "$_chmod_needed" ]]; then
|
||||||
|
echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
|
||||||
|
sudo chmod -R 775 /opt/homelab 2>/dev/null || true
|
||||||
|
else
|
||||||
|
echo "Permissions already correct, skipping chmod"
|
||||||
|
fi
|
||||||
|
|
||||||
# 4. Run docker compose up -d --build --force-recreate
|
# 4. Run docker compose up -d --build --force-recreate
|
||||||
echo "--- Starting Control Plane Services ---"
|
echo "--- Starting Control Plane Services ---"
|
||||||
|
|
|
||||||
|
|
@ -56,6 +56,9 @@ services:
|
||||||
executor:
|
executor:
|
||||||
build: .
|
build: .
|
||||||
container_name: control-plane-executor
|
container_name: control-plane-executor
|
||||||
|
user: "1000:1000"
|
||||||
|
group_add:
|
||||||
|
- "999"
|
||||||
command: python src/executor.py
|
command: python src/executor.py
|
||||||
volumes:
|
volumes:
|
||||||
- /opt/homelab:/opt/homelab
|
- /opt/homelab:/opt/homelab
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,16 @@ import logging
|
||||||
import subprocess
|
import subprocess
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def _atomic_write_json(path: Path, data) -> None:
|
||||||
|
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||||
|
tmp = path.with_suffix(".tmp")
|
||||||
|
with open(tmp, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
f.flush()
|
||||||
|
os.fsync(f.fileno())
|
||||||
|
os.replace(tmp, path)
|
||||||
|
|
||||||
# Constants and Paths
|
# Constants and Paths
|
||||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
||||||
|
|
@ -57,8 +67,7 @@ class Executor:
|
||||||
data = json.load(f)
|
data = json.load(f)
|
||||||
data["status"] = "running"
|
data["status"] = "running"
|
||||||
data["started_at"] = time.time()
|
data["started_at"] = time.time()
|
||||||
with open(running_path, "w") as f:
|
_atomic_write_json(running_path, data)
|
||||||
json.dump(data, f, indent=2)
|
|
||||||
action_file.unlink()
|
action_file.unlink()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Failed to move {action_id} to running: {e}")
|
logger.error(f"Failed to move {action_id} to running: {e}")
|
||||||
|
|
@ -121,8 +130,7 @@ class Executor:
|
||||||
data["finished_at"] = time.time()
|
data["finished_at"] = time.time()
|
||||||
if not success:
|
if not success:
|
||||||
data["error"] = error_msg
|
data["error"] = error_msg
|
||||||
with open(target_path, "w") as f:
|
_atomic_write_json(target_path, data)
|
||||||
json.dump(data, f, indent=2)
|
|
||||||
running_path.unlink()
|
running_path.unlink()
|
||||||
logger.info(f"Action {action_id} {target_status}")
|
logger.info(f"Action {action_id} {target_status}")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|
|
||||||
|
|
@ -147,12 +147,18 @@ def current_deployments():
|
||||||
|
|
||||||
|
|
||||||
def current_incidents():
|
def current_incidents():
|
||||||
"""Return incidents as a list sorted most-recent-first."""
|
"""Return active incidents as a list sorted most-recent-first.
|
||||||
|
|
||||||
|
Only incidents with status='active' are returned; resolved and cancelled
|
||||||
|
records are excluded so the dashboard reflects the current operational state.
|
||||||
|
"""
|
||||||
raw = read_json_file(WORLD_DIR / "incidents.json", default={})
|
raw = read_json_file(WORLD_DIR / "incidents.json", default={})
|
||||||
if isinstance(raw, list):
|
if isinstance(raw, list):
|
||||||
return raw
|
return [i for i in raw if i.get("status") == "active"]
|
||||||
result = []
|
result = []
|
||||||
for inc in raw.values():
|
for inc in raw.values():
|
||||||
|
if inc.get("status") != "active":
|
||||||
|
continue
|
||||||
# Synthesise a human-readable message if not stored (observer doesn't set one).
|
# Synthesise a human-readable message if not stored (observer doesn't set one).
|
||||||
if "message" not in inc:
|
if "message" not in inc:
|
||||||
inc = dict(inc)
|
inc = dict(inc)
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,16 @@ import logging
|
||||||
import yaml
|
import yaml
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def _atomic_write_json(path: Path, data) -> None:
|
||||||
|
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||||
|
tmp = path.with_suffix(".tmp")
|
||||||
|
with open(tmp, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
f.flush()
|
||||||
|
os.fsync(f.fileno())
|
||||||
|
os.replace(tmp, path)
|
||||||
|
|
||||||
# Constants and Paths
|
# Constants and Paths
|
||||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||||
|
|
@ -175,7 +185,11 @@ class Supervisor:
|
||||||
logger.error(f"Failed to load {svc_file}: {e}")
|
logger.error(f"Failed to load {svc_file}: {e}")
|
||||||
self.desired_state["services"] = services
|
self.desired_state["services"] = services
|
||||||
|
|
||||||
def _load_actual_state(self):
|
def _load_actual_state(self) -> bool:
|
||||||
|
"""Load world state from disk. Returns False if any file is unreadable
|
||||||
|
(empty / mid-write truncation), in which case actual_state is NOT updated
|
||||||
|
so the caller can skip this reconcile cycle rather than treating missing
|
||||||
|
data as a real drift signal."""
|
||||||
files = {
|
files = {
|
||||||
"services": WORLD_DIR / "services.json",
|
"services": WORLD_DIR / "services.json",
|
||||||
"nodes": WORLD_DIR / "nodes.json",
|
"nodes": WORLD_DIR / "nodes.json",
|
||||||
|
|
@ -188,8 +202,11 @@ class Supervisor:
|
||||||
with open(path, "r") as f:
|
with open(path, "r") as f:
|
||||||
raw[key] = json.load(f)
|
raw[key] = json.load(f)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Failed to load {key} actual state: {e}")
|
logger.warning(
|
||||||
raw[key] = {}
|
f"World state {path.name} unreadable (truncated write?): {e} "
|
||||||
|
f"— skipping reconcile cycle, keeping last known state"
|
||||||
|
)
|
||||||
|
return False
|
||||||
else:
|
else:
|
||||||
raw[key] = {}
|
raw[key] = {}
|
||||||
|
|
||||||
|
|
@ -219,6 +236,7 @@ class Supervisor:
|
||||||
self.actual_state["services"] = normalized_services
|
self.actual_state["services"] = normalized_services
|
||||||
self.actual_state["nodes"] = raw.get("nodes", {})
|
self.actual_state["nodes"] = raw.get("nodes", {})
|
||||||
self.actual_state["incidents"] = normalized_incidents
|
self.actual_state["incidents"] = normalized_incidents
|
||||||
|
return True
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
# ------------------------------------------------------------------
|
||||||
# Incident helpers
|
# Incident helpers
|
||||||
|
|
@ -252,7 +270,8 @@ class Supervisor:
|
||||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||||
|
|
||||||
self._load_desired_state()
|
self._load_desired_state()
|
||||||
self._load_actual_state()
|
if not self._load_actual_state():
|
||||||
|
return # world state unreadable this cycle — skip to avoid false drift
|
||||||
|
|
||||||
drifts = []
|
drifts = []
|
||||||
|
|
||||||
|
|
@ -375,8 +394,7 @@ class Supervisor:
|
||||||
|
|
||||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||||
try:
|
try:
|
||||||
with open(action_path, "w") as f:
|
_atomic_write_json(action_path, action)
|
||||||
json.dump(action, f, indent=2)
|
|
||||||
logger.info(
|
logger.info(
|
||||||
f"Generated recommendation: {action_id} "
|
f"Generated recommendation: {action_id} "
|
||||||
f"(type={action['type']}, risk={action['risk_level']})"
|
f"(type={action['type']}, risk={action['risk_level']})"
|
||||||
|
|
@ -428,8 +446,7 @@ class Supervisor:
|
||||||
|
|
||||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||||
try:
|
try:
|
||||||
with open(action_path, "w") as f:
|
_atomic_write_json(action_path, action)
|
||||||
json.dump(action, f, indent=2)
|
|
||||||
logger.info(
|
logger.info(
|
||||||
f"Generated disk cleanup recommendation: {action_id} "
|
f"Generated disk cleanup recommendation: {action_id} "
|
||||||
f"(node={node}, risk=guarded)"
|
f"(node={node}, risk=guarded)"
|
||||||
|
|
@ -494,8 +511,7 @@ class Supervisor:
|
||||||
action["status"] = "cancelled"
|
action["status"] = "cancelled"
|
||||||
action["cancelled_reason"] = cancel_reason
|
action["cancelled_reason"] = cancel_reason
|
||||||
action["cancelled_at"] = time.time()
|
action["cancelled_at"] = time.time()
|
||||||
with open(dest, "w") as f:
|
_atomic_write_json(dest, action)
|
||||||
json.dump(action, f, indent=2)
|
|
||||||
action_file.unlink()
|
action_file.unlink()
|
||||||
logger.info(
|
logger.info(
|
||||||
f"Auto-cancelled {action_file.name}: "
|
f"Auto-cancelled {action_file.name}: "
|
||||||
|
|
@ -725,8 +741,7 @@ class Supervisor:
|
||||||
action["status"] = "cancelled"
|
action["status"] = "cancelled"
|
||||||
action["cancelled_reason"] = "ha_websocket_recovered"
|
action["cancelled_reason"] = "ha_websocket_recovered"
|
||||||
action["cancelled_at"] = time.time()
|
action["cancelled_at"] = time.time()
|
||||||
with open(dest, "w") as f:
|
_atomic_write_json(dest, action)
|
||||||
json.dump(action, f, indent=2)
|
|
||||||
pending_path.unlink()
|
pending_path.unlink()
|
||||||
logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
|
logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|
@ -736,8 +751,7 @@ class Supervisor:
|
||||||
action_id = action["action_id"]
|
action_id = action["action_id"]
|
||||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||||
try:
|
try:
|
||||||
with open(action_path, "w") as f:
|
_atomic_write_json(action_path, action)
|
||||||
json.dump(action, f, indent=2)
|
|
||||||
logger.info(
|
logger.info(
|
||||||
f"Generated HA action: {action_id} "
|
f"Generated HA action: {action_id} "
|
||||||
f"(type={action['type']}, risk={action['risk_level']})"
|
f"(type={action['type']}, risk={action['risk_level']})"
|
||||||
|
|
|
||||||
333
services/control-plane/tests/test_incident_lifecycle.py
Normal file
333
services/control-plane/tests/test_incident_lifecycle.py
Normal file
|
|
@ -0,0 +1,333 @@
|
||||||
|
"""Tests for incident lifecycle: auto-resolve, orphan detection, timestamp parsing."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
# Observer lives outside the control-plane package; add scripts/ to path.
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent / "scripts"))
|
||||||
|
from observer.observer import Observer, _parse_ts, _atomic_write_json
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _make_observer(tmp_path: Path) -> Observer:
|
||||||
|
"""Return an Observer with all runtime paths redirected to tmp_path."""
|
||||||
|
import observer.observer as obs_mod
|
||||||
|
|
||||||
|
world = tmp_path / "world"
|
||||||
|
state = tmp_path / "state"
|
||||||
|
events = tmp_path / "events"
|
||||||
|
logs = tmp_path / "logs"
|
||||||
|
repo = tmp_path / "repo"
|
||||||
|
|
||||||
|
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Minimal topology so inventory isn't empty (avoids prune-guard early-return)
|
||||||
|
(repo / "inventory" / "topology.yaml").write_text(
|
||||||
|
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
original_world = obs_mod.WORLD_DIR
|
||||||
|
original_state = obs_mod.STATE_DIR
|
||||||
|
original_events = obs_mod.EVENTS_DIR
|
||||||
|
original_logs = obs_mod.LOGS_DIR
|
||||||
|
original_inventory = obs_mod.INVENTORY_TOPOLOGY
|
||||||
|
original_repo = obs_mod.REPO_ROOT
|
||||||
|
|
||||||
|
obs_mod.WORLD_DIR = world
|
||||||
|
obs_mod.STATE_DIR = state
|
||||||
|
obs_mod.EVENTS_DIR = events
|
||||||
|
obs_mod.LOGS_DIR = logs
|
||||||
|
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
|
||||||
|
obs_mod.REPO_ROOT = repo
|
||||||
|
|
||||||
|
obs = Observer()
|
||||||
|
|
||||||
|
# Restore module-level constants (monkeypatching at module level is sufficient
|
||||||
|
# for the Observer instance which captures paths at construction time via globals)
|
||||||
|
obs_mod.WORLD_DIR = original_world
|
||||||
|
obs_mod.STATE_DIR = original_state
|
||||||
|
obs_mod.EVENTS_DIR = original_events
|
||||||
|
obs_mod.LOGS_DIR = original_logs
|
||||||
|
obs_mod.INVENTORY_TOPOLOGY = original_inventory
|
||||||
|
obs_mod.REPO_ROOT = original_repo
|
||||||
|
|
||||||
|
return obs
|
||||||
|
|
||||||
|
|
||||||
|
def _make_observer_simple(tmp_path: Path):
|
||||||
|
"""Return an Observer instance and patch its world_state in-place."""
|
||||||
|
import observer.observer as obs_mod
|
||||||
|
|
||||||
|
world = tmp_path / "world"
|
||||||
|
state = tmp_path / "state"
|
||||||
|
events = tmp_path / "events"
|
||||||
|
logs = tmp_path / "logs"
|
||||||
|
repo = tmp_path / "repo"
|
||||||
|
|
||||||
|
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
(repo / "inventory" / "topology.yaml").write_text(
|
||||||
|
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Patch before construction
|
||||||
|
obs_mod.WORLD_DIR = world
|
||||||
|
obs_mod.STATE_DIR = state
|
||||||
|
obs_mod.EVENTS_DIR = events
|
||||||
|
obs_mod.LOGS_DIR = logs
|
||||||
|
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
|
||||||
|
obs_mod.REPO_ROOT = repo
|
||||||
|
|
||||||
|
obs = Observer()
|
||||||
|
return obs
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# 1. _parse_ts — timestamp normalisation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_parse_ts_int():
|
||||||
|
ts = int(time.time()) - 3600
|
||||||
|
assert abs(_parse_ts(ts) - ts) < 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_ts_float():
|
||||||
|
ts = time.time() - 100.5
|
||||||
|
assert abs(_parse_ts(ts) - ts) < 0.01
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_ts_iso_string():
|
||||||
|
# ISO format as emitted by events.py / stability-agent
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
iso = "2026-06-01T00:03:22Z"
|
||||||
|
expected = datetime(2026, 6, 1, 0, 3, 22, tzinfo=timezone.utc).timestamp()
|
||||||
|
result = _parse_ts(iso)
|
||||||
|
assert result > 0
|
||||||
|
assert isinstance(result, float)
|
||||||
|
assert abs(result - expected) < 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_ts_none_returns_zero():
|
||||||
|
assert _parse_ts(None) == 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_ts_garbage_returns_zero():
|
||||||
|
assert _parse_ts("not-a-date") == 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_ts_zero_int():
|
||||||
|
assert _parse_ts(0) == 0.0
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# 2. Lifecycle: service_healthy event resolves linked incident
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_service_healthy_resolves_active_incident(tmp_path):
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-111-vps-outline"
|
||||||
|
obs.world_state["services"]["vps/outline"] = {
|
||||||
|
"node": "vps", "service": "outline",
|
||||||
|
"status": "unhealthy", "last_check": None,
|
||||||
|
"incident_id": inc_id,
|
||||||
|
}
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "node": "vps", "service": "outline",
|
||||||
|
"status": "active", "trigger_type": "service_unhealthy",
|
||||||
|
"started_at": int(time.time()) - 600,
|
||||||
|
"last_occurrence": int(time.time()) - 600,
|
||||||
|
"occurrence_count": 1, "events": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
obs.process_event({
|
||||||
|
"type": "service_healthy",
|
||||||
|
"node": "vps",
|
||||||
|
"service": "outline",
|
||||||
|
"severity": "info",
|
||||||
|
"timestamp": int(time.time()),
|
||||||
|
"payload": {},
|
||||||
|
})
|
||||||
|
|
||||||
|
assert obs.world_state["services"]["vps/outline"]["status"] == "healthy"
|
||||||
|
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
|
||||||
|
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||||
|
|
||||||
|
|
||||||
|
def test_service_healthy_does_not_resolve_other_incidents(tmp_path):
|
||||||
|
"""service_healthy for service A must not touch incident for service B."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_b = "inc-222-vps-supervisor"
|
||||||
|
obs.world_state["services"]["vps/supervisor"] = {
|
||||||
|
"node": "vps", "service": "supervisor",
|
||||||
|
"status": "unhealthy", "last_check": None,
|
||||||
|
"incident_id": inc_b,
|
||||||
|
}
|
||||||
|
obs.world_state["incidents"][inc_b] = {
|
||||||
|
"id": inc_b, "status": "active",
|
||||||
|
"last_occurrence": int(time.time()) - 300,
|
||||||
|
}
|
||||||
|
|
||||||
|
obs.process_event({
|
||||||
|
"type": "service_healthy",
|
||||||
|
"node": "vps",
|
||||||
|
"service": "outline", # different service
|
||||||
|
"severity": "info",
|
||||||
|
"timestamp": int(time.time()),
|
||||||
|
"payload": {},
|
||||||
|
})
|
||||||
|
|
||||||
|
assert obs.world_state["incidents"][inc_b]["status"] == "active"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# 3. _prune_stale_world: healthy-service-linked incident → immediate resolve
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_prune_resolves_healthy_linked_incident(tmp_path):
|
||||||
|
"""If a service is healthy but still points at an active incident, resolve it."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-333-vps-outline"
|
||||||
|
obs.world_state["services"]["vps/outline"] = {
|
||||||
|
"node": "vps", "service": "outline",
|
||||||
|
"status": "healthy", # <-- healthy but incident_id still set
|
||||||
|
"last_check": None,
|
||||||
|
"incident_id": inc_id,
|
||||||
|
}
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "active",
|
||||||
|
"started_at": int(time.time()) - 7200,
|
||||||
|
"last_occurrence": int(time.time()) - 7200,
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world()
|
||||||
|
|
||||||
|
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
|
||||||
|
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||||
|
|
||||||
|
|
||||||
|
def test_prune_resolves_healthy_linked_incident_iso_timestamp(tmp_path):
|
||||||
|
"""Healthy-linked incident with ISO-string last_occurrence must still resolve."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-444-vps-outline"
|
||||||
|
obs.world_state["services"]["vps/outline"] = {
|
||||||
|
"node": "vps", "service": "outline",
|
||||||
|
"status": "healthy", "last_check": None, "incident_id": inc_id,
|
||||||
|
}
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "active",
|
||||||
|
"last_occurrence": "2026-06-01T00:03:22Z", # ISO string from events.py
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world() # must not raise TypeError
|
||||||
|
|
||||||
|
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# 4. _prune_stale_world: orphaned incident (no service link) → resolve after 5 min
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_prune_resolves_orphaned_incident_old_enough(tmp_path):
|
||||||
|
"""Orphaned active incident older than 5 min must be auto-resolved."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-555-vps-supervisor"
|
||||||
|
# No service entry links to this incident
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "active", "node": "vps", "service": "supervisor",
|
||||||
|
"last_occurrence": int(time.time()) - 400, # 6.7 min ago
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world()
|
||||||
|
|
||||||
|
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||||
|
|
||||||
|
|
||||||
|
def test_prune_does_not_resolve_orphaned_incident_too_recent(tmp_path):
|
||||||
|
"""Orphaned incident younger than 5 min must stay active (guard against race)."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-666-vps-supervisor"
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "active",
|
||||||
|
"last_occurrence": int(time.time()) - 60, # 1 min ago — within guard
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world()
|
||||||
|
|
||||||
|
assert obs.world_state["incidents"][inc_id]["status"] == "active"
|
||||||
|
|
||||||
|
|
||||||
|
def test_prune_resolves_orphaned_incident_iso_timestamp(tmp_path):
|
||||||
|
"""Orphaned incident with ISO-string last_occurrence must resolve correctly."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-777-vps-outline"
|
||||||
|
# ISO timestamp well in the past (2026-06-01)
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "active",
|
||||||
|
"last_occurrence": "2026-06-01T00:03:22Z",
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world() # must not raise TypeError
|
||||||
|
|
||||||
|
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||||
|
|
||||||
|
|
||||||
|
def test_prune_does_not_touch_linked_incident(tmp_path):
|
||||||
|
"""An active incident still linked from a non-healthy service must stay active."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-888-vps-outline"
|
||||||
|
obs.world_state["services"]["vps/outline"] = {
|
||||||
|
"node": "vps", "service": "outline",
|
||||||
|
"status": "unhealthy", # <-- still unhealthy
|
||||||
|
"last_check": None,
|
||||||
|
"incident_id": inc_id,
|
||||||
|
}
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "active",
|
||||||
|
"last_occurrence": int(time.time()) - 3600,
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world()
|
||||||
|
|
||||||
|
assert obs.world_state["incidents"][inc_id]["status"] == "active"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# 5. 7-day stale incident prune with ISO resolved_at
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_prune_removes_old_resolved_incident_iso_resolved_at(tmp_path):
|
||||||
|
"""Resolved incidents with ISO-string resolved_at older than 7 days must be pruned."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-old-resolved"
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "resolved",
|
||||||
|
"resolved_at": "2026-05-01T00:00:00Z", # >7 days before 2026-06-03
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world()
|
||||||
|
|
||||||
|
assert inc_id not in obs.world_state["incidents"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_prune_keeps_recently_resolved_incident(tmp_path):
|
||||||
|
"""Resolved incidents within 7 days must be kept."""
|
||||||
|
obs = _make_observer_simple(tmp_path)
|
||||||
|
inc_id = "inc-recent-resolved"
|
||||||
|
obs.world_state["incidents"][inc_id] = {
|
||||||
|
"id": inc_id, "status": "resolved",
|
||||||
|
"resolved_at": time.time() - 86400, # 1 day ago
|
||||||
|
}
|
||||||
|
|
||||||
|
obs._prune_stale_world()
|
||||||
|
|
||||||
|
assert inc_id in obs.world_state["incidents"]
|
||||||
199
services/control-plane/tests/test_state_reliability.py
Normal file
199
services/control-plane/tests/test_state_reliability.py
Normal file
|
|
@ -0,0 +1,199 @@
|
||||||
|
"""Tests for atomic writes and resilient world-state loading in the supervisor."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
||||||
|
import supervisor as supervisor_module
|
||||||
|
from supervisor import Supervisor, _atomic_write_json
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers (reused from test_supervisor_ha)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
|
||||||
|
actions = tmp_path / "actions"
|
||||||
|
events = tmp_path / "events"
|
||||||
|
world = tmp_path / "world"
|
||||||
|
repo = tmp_path / "repo"
|
||||||
|
|
||||||
|
for d in (actions, events, world, repo / "hosts"):
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
|
||||||
|
monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
|
||||||
|
monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
|
||||||
|
monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
|
||||||
|
|
||||||
|
sup = Supervisor()
|
||||||
|
sup.desired_state = {"services": {}}
|
||||||
|
sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
||||||
|
return sup
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# 1. atomic_write_json correctness
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_atomic_write_json_produces_valid_json(tmp_path):
|
||||||
|
path = tmp_path / "out.json"
|
||||||
|
data = {"services": {"vps/outline": {"status": "healthy"}}, "count": 42}
|
||||||
|
_atomic_write_json(path, data)
|
||||||
|
|
||||||
|
assert path.exists(), "output file must exist after atomic write"
|
||||||
|
loaded = json.loads(path.read_text())
|
||||||
|
assert loaded == data
|
||||||
|
|
||||||
|
|
||||||
|
def test_atomic_write_json_no_tmp_left_behind(tmp_path):
|
||||||
|
path = tmp_path / "world.json"
|
||||||
|
_atomic_write_json(path, {"ok": True})
|
||||||
|
|
||||||
|
tmp = path.with_suffix(".tmp")
|
||||||
|
assert not tmp.exists(), ".tmp must be cleaned up by os.replace"
|
||||||
|
|
||||||
|
|
||||||
|
def test_atomic_write_json_overwrites_existing(tmp_path):
|
||||||
|
path = tmp_path / "state.json"
|
||||||
|
path.write_text('{"old": true}')
|
||||||
|
_atomic_write_json(path, {"new": True})
|
||||||
|
assert json.loads(path.read_text()) == {"new": True}
|
||||||
|
|
||||||
|
|
||||||
|
def test_atomic_write_json_nested_structure(tmp_path):
|
||||||
|
path = tmp_path / "complex.json"
|
||||||
|
data = {
|
||||||
|
"nodes": {"vps": {"status": "online", "disk_usage_pct": 42}},
|
||||||
|
"incidents": {},
|
||||||
|
"list": [1, 2, 3],
|
||||||
|
}
|
||||||
|
_atomic_write_json(path, data)
|
||||||
|
assert json.loads(path.read_text()) == data
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# 2. Resilient loader: empty / truncated file → skip cycle, no drift
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _populate_desired(sup: Supervisor, svc_key: str = "vps/outline"):
|
||||||
|
node, service = svc_key.split("/", 1)
|
||||||
|
sup.desired_state["services"][svc_key] = {
|
||||||
|
"node": node,
|
||||||
|
"service": service,
|
||||||
|
"desired": "running",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def test_empty_services_json_skips_reconcile(tmp_path, monkeypatch):
|
||||||
|
"""Empty services.json (truncated write) must not generate any redeploy action."""
|
||||||
|
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||||
|
_populate_desired(sup)
|
||||||
|
|
||||||
|
# Write empty services.json — simulates a mid-write truncation
|
||||||
|
(tmp_path / "world" / "services.json").write_text("")
|
||||||
|
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||||
|
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||||
|
|
||||||
|
sup.reconcile()
|
||||||
|
|
||||||
|
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||||
|
assert pending == [], f"No actions should be generated on empty state file, got: {[p.name for p in pending]}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_truncated_services_json_skips_reconcile(tmp_path, monkeypatch):
|
||||||
|
"""Partially-written (truncated mid-write) JSON must not generate any action."""
|
||||||
|
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||||
|
_populate_desired(sup)
|
||||||
|
|
||||||
|
(tmp_path / "world" / "services.json").write_text('{"vps/outline": {"status": "hea')
|
||||||
|
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||||
|
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||||
|
|
||||||
|
sup.reconcile()
|
||||||
|
|
||||||
|
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||||
|
assert pending == [], f"No actions expected on truncated state, got: {[p.name for p in pending]}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_empty_incidents_json_skips_reconcile(tmp_path, monkeypatch):
|
||||||
|
"""Empty incidents.json (any world-state file failing) skips full cycle."""
|
||||||
|
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||||
|
_populate_desired(sup)
|
||||||
|
|
||||||
|
(tmp_path / "world" / "services.json").write_text("{}")
|
||||||
|
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||||
|
(tmp_path / "world" / "incidents.json").write_text("")
|
||||||
|
|
||||||
|
sup.reconcile()
|
||||||
|
|
||||||
|
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||||
|
assert pending == [], f"No actions expected when any state file is unreadable, got: {[p.name for p in pending]}"
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_actual_state_returns_false_on_empty_file(tmp_path, monkeypatch):
|
||||||
|
"""_load_actual_state must return False (not raise) when a file is empty."""
|
||||||
|
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||||
|
|
||||||
|
(tmp_path / "world" / "services.json").write_text("")
|
||||||
|
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||||
|
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||||
|
|
||||||
|
result = sup._load_actual_state()
|
||||||
|
assert result is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_load_actual_state_returns_true_on_valid_files(tmp_path, monkeypatch):
|
||||||
|
"""_load_actual_state returns True and populates actual_state on valid files."""
|
||||||
|
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||||
|
|
||||||
|
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
||||||
|
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
||||||
|
(tmp_path / "world" / "nodes.json").write_text('{"vps": {"status": "online"}}')
|
||||||
|
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||||
|
|
||||||
|
result = sup._load_actual_state()
|
||||||
|
assert result is True
|
||||||
|
assert "vps/outline" in sup.actual_state["services"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_parse_failure_preserves_last_known_good_state(tmp_path, monkeypatch):
|
||||||
|
"""When a file becomes unreadable, actual_state retains the previous good values."""
|
||||||
|
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||||
|
|
||||||
|
# First successful load
|
||||||
|
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
||||||
|
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
||||||
|
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||||
|
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||||
|
assert sup._load_actual_state() is True
|
||||||
|
assert "vps/outline" in sup.actual_state["services"]
|
||||||
|
|
||||||
|
# File becomes empty (race condition)
|
||||||
|
(tmp_path / "world" / "services.json").write_text("")
|
||||||
|
assert sup._load_actual_state() is False
|
||||||
|
|
||||||
|
# State must be unchanged from the previous good load
|
||||||
|
assert "vps/outline" in sup.actual_state["services"], \
|
||||||
|
"Last-known-good state must be preserved on parse failure"
|
||||||
|
|
||||||
|
|
||||||
|
def test_healthy_service_does_not_generate_action(tmp_path, monkeypatch):
|
||||||
|
"""A desired service that appears healthy in world state generates no action."""
|
||||||
|
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||||
|
_populate_desired(sup)
|
||||||
|
|
||||||
|
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
||||||
|
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
||||||
|
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||||
|
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||||
|
|
||||||
|
sup.reconcile()
|
||||||
|
|
||||||
|
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||||
|
assert pending == [], "Healthy service must not generate any action"
|
||||||
|
|
@ -1,44 +0,0 @@
|
||||||
services:
|
|
||||||
app:
|
|
||||||
image: joplin/server:latest
|
|
||||||
container_name: joplin-server
|
|
||||||
restart: unless-stopped
|
|
||||||
env_file:
|
|
||||||
- /opt/homelab/config/joplin/.env
|
|
||||||
ports:
|
|
||||||
- "127.0.0.1:22300:22300"
|
|
||||||
depends_on:
|
|
||||||
db:
|
|
||||||
condition: service_healthy
|
|
||||||
networks:
|
|
||||||
- joplin_net
|
|
||||||
- npm_default
|
|
||||||
|
|
||||||
db:
|
|
||||||
image: postgres:18
|
|
||||||
container_name: joplin-db
|
|
||||||
restart: unless-stopped
|
|
||||||
env_file:
|
|
||||||
- /opt/homelab/config/joplin/.env
|
|
||||||
volumes:
|
|
||||||
- postgres_data:/var/lib/postgresql
|
|
||||||
networks:
|
|
||||||
- joplin_net
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD-SHELL", "pg_isready -U joplin -d joplin"]
|
|
||||||
interval: 10s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 5
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
postgres_data:
|
|
||||||
external: true
|
|
||||||
name: joplin_postgres_data
|
|
||||||
|
|
||||||
networks:
|
|
||||||
joplin_net:
|
|
||||||
driver: bridge
|
|
||||||
name: joplin-net
|
|
||||||
npm_default:
|
|
||||||
external: true
|
|
||||||
name: npm_default
|
|
||||||
|
|
@ -1,20 +0,0 @@
|
||||||
# Joplin Server — /opt/homelab/config/joplin/.env
|
|
||||||
# Both the `app` (joplin-server) and `db` (postgres) containers read this file.
|
|
||||||
|
|
||||||
# Application
|
|
||||||
APP_BASE_URL=https://joplin.example.com
|
|
||||||
APP_PORT=22300
|
|
||||||
TRUST_PROXY=1
|
|
||||||
RUNNING_IN_DOCKER=1
|
|
||||||
|
|
||||||
# Database connection (joplin-server reads these)
|
|
||||||
DB_CLIENT=pg
|
|
||||||
POSTGRES_HOST=db
|
|
||||||
POSTGRES_PORT=5432
|
|
||||||
POSTGRES_USER=joplin
|
|
||||||
POSTGRES_DB=joplin
|
|
||||||
POSTGRES_DATABASE=joplin
|
|
||||||
POSTGRES_PASSWORD=
|
|
||||||
|
|
||||||
# Runtime
|
|
||||||
PM2_HOME=/opt/pm2
|
|
||||||
|
|
@ -1,15 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
# Healthcheck for Joplin Server
|
|
||||||
|
|
||||||
if ! docker ps --filter "name=joplin-server" --filter "status=running" | grep -q "joplin-server"; then
|
|
||||||
echo "[FAIL] joplin-server container is not running"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if ! curl -sf http://localhost:22300/api/ping > /dev/null; then
|
|
||||||
echo "[FAIL] Joplin Server HTTP endpoint not responding"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "[OK] Joplin Server is healthy"
|
|
||||||
exit 0
|
|
||||||
|
|
@ -1,31 +0,0 @@
|
||||||
service:
|
|
||||||
name: joplin
|
|
||||||
owner_node: vps
|
|
||||||
exposure: tailscale-internal
|
|
||||||
dependencies:
|
|
||||||
- db
|
|
||||||
ports:
|
|
||||||
- container: 22300
|
|
||||||
host: 22300
|
|
||||||
protocol: tcp
|
|
||||||
bind: 127.0.0.1
|
|
||||||
healthcheck:
|
|
||||||
type: http
|
|
||||||
endpoint: http://localhost:22300/api/ping
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
restart_policy: unless-stopped
|
|
||||||
persistence:
|
|
||||||
paths:
|
|
||||||
- volume:joplin_postgres_data # Joplin notes DB
|
|
||||||
runtime:
|
|
||||||
env_file: /opt/homelab/config/joplin/.env
|
|
||||||
env_vars:
|
|
||||||
- APP_BASE_URL
|
|
||||||
- APP_PORT
|
|
||||||
- DB_CLIENT
|
|
||||||
- POSTGRES_HOST
|
|
||||||
- POSTGRES_USER
|
|
||||||
- POSTGRES_PASSWORD
|
|
||||||
- POSTGRES_DB
|
|
||||||
|
|
@ -14,8 +14,11 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
# pyyaml : may be needed for reading host config snippets
|
# pyyaml : may be needed for reading host config snippets
|
||||||
RUN pip install --no-cache-dir "docker>=6.0" psutil pyyaml
|
RUN pip install --no-cache-dir "docker>=6.0" psutil pyyaml
|
||||||
|
|
||||||
|
RUN useradd -m -u 1000 homelab
|
||||||
|
|
||||||
COPY src/ /app/src/
|
COPY src/ /app/src/
|
||||||
|
|
||||||
ENV PYTHONUNBUFFERED=1
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
|
||||||
|
USER homelab
|
||||||
CMD ["python", "src/node_agent.py"]
|
CMD ["python", "src/node_agent.py"]
|
||||||
|
|
|
||||||
|
|
@ -2,6 +2,9 @@ services:
|
||||||
node-agent:
|
node-agent:
|
||||||
build: .
|
build: .
|
||||||
container_name: node-agent
|
container_name: node-agent
|
||||||
|
user: "1000:1000"
|
||||||
|
group_add:
|
||||||
|
- "999"
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
|
|
||||||
environment:
|
environment:
|
||||||
|
|
|
||||||
|
|
@ -8,7 +8,5 @@ services:
|
||||||
- '81:81'
|
- '81:81'
|
||||||
- '443:443'
|
- '443:443'
|
||||||
volumes:
|
volumes:
|
||||||
# Data lives at dockeruser's path — do NOT move these without a migration plan.
|
- /opt/homelab/data/npm/data:/data
|
||||||
# Proxy hosts, SSL certs, and DB are stored here.
|
- /opt/homelab/data/npm/letsencrypt:/etc/letsencrypt
|
||||||
- /home/dockeruser/docker/npm/data:/data
|
|
||||||
- /home/dockeruser/docker/npm/letsencrypt:/etc/letsencrypt
|
|
||||||
|
|
|
||||||
|
|
@ -22,6 +22,10 @@ service:
|
||||||
restart_policy: unless-stopped
|
restart_policy: unless-stopped
|
||||||
persistence:
|
persistence:
|
||||||
paths:
|
paths:
|
||||||
- /home/dockeruser/docker/npm/data
|
- /opt/homelab/data/npm/data
|
||||||
- /home/dockeruser/docker/npm/letsencrypt
|
- /opt/homelab/data/npm/letsencrypt
|
||||||
|
runtime:
|
||||||
|
directories:
|
||||||
|
- /opt/homelab/data/npm/data
|
||||||
|
- /opt/homelab/data/npm/letsencrypt
|
||||||
env_vars: []
|
env_vars: []
|
||||||
|
|
|
||||||
|
|
@ -1,68 +0,0 @@
|
||||||
services:
|
|
||||||
outline:
|
|
||||||
image: outlinewiki/outline:1.6.1
|
|
||||||
container_name: outline-outline-1
|
|
||||||
restart: unless-stopped
|
|
||||||
env_file:
|
|
||||||
- /opt/homelab/config/outline/.env
|
|
||||||
ports:
|
|
||||||
- "3000:3000"
|
|
||||||
volumes:
|
|
||||||
- outline_storage:/var/lib/outline/data
|
|
||||||
depends_on:
|
|
||||||
- postgres
|
|
||||||
- redis
|
|
||||||
networks:
|
|
||||||
- outline_internal
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "wget", "-qO-", "http://localhost:3000/_health"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 30s
|
|
||||||
|
|
||||||
postgres:
|
|
||||||
image: postgres:16-alpine
|
|
||||||
container_name: outline-postgres-1
|
|
||||||
restart: unless-stopped
|
|
||||||
env_file:
|
|
||||||
- /opt/homelab/config/outline/.env
|
|
||||||
volumes:
|
|
||||||
- postgres_data:/var/lib/postgresql/data
|
|
||||||
networks:
|
|
||||||
- outline_internal
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD-SHELL", "pg_isready -U outline -d outline"]
|
|
||||||
interval: 10s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 5
|
|
||||||
|
|
||||||
redis:
|
|
||||||
image: redis:7-alpine
|
|
||||||
container_name: outline-redis-1
|
|
||||||
restart: unless-stopped
|
|
||||||
volumes:
|
|
||||||
- redis_data:/data
|
|
||||||
networks:
|
|
||||||
- outline_internal
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "redis-cli", "ping"]
|
|
||||||
interval: 10s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 3
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
outline_storage:
|
|
||||||
external: true
|
|
||||||
name: outline_outline_storage
|
|
||||||
postgres_data:
|
|
||||||
external: true
|
|
||||||
name: outline_postgres_data
|
|
||||||
redis_data:
|
|
||||||
external: true
|
|
||||||
name: outline_redis_data
|
|
||||||
|
|
||||||
networks:
|
|
||||||
outline_internal:
|
|
||||||
driver: bridge
|
|
||||||
name: outline_outline_internal
|
|
||||||
|
|
@ -1,40 +0,0 @@
|
||||||
# Outline Wiki — /opt/homelab/config/outline/.env
|
|
||||||
# Both the `outline` and `postgres` containers read this file.
|
|
||||||
|
|
||||||
# Application
|
|
||||||
URL=https://outline.example.com
|
|
||||||
NODE_ENV=production
|
|
||||||
PORT=3000
|
|
||||||
FILE_STORAGE=local
|
|
||||||
FILE_STORAGE_LOCAL_ROOT_DIR=/var/lib/outline/data
|
|
||||||
FORCE_HTTPS=true
|
|
||||||
|
|
||||||
# Secrets — generate with: openssl rand -hex 32
|
|
||||||
SECRET_KEY=
|
|
||||||
UTILS_SECRET=
|
|
||||||
|
|
||||||
# Database
|
|
||||||
DATABASE_URL=postgres://outline:<password>@postgres:5432/outline
|
|
||||||
PGSSLMODE=disable
|
|
||||||
|
|
||||||
# Redis
|
|
||||||
REDIS_URL=redis://redis:6379
|
|
||||||
|
|
||||||
# Postgres sidecar vars (read by the postgres container)
|
|
||||||
POSTGRES_USER=outline
|
|
||||||
POSTGRES_DB=outline
|
|
||||||
POSTGRES_PASSWORD=
|
|
||||||
|
|
||||||
# Google OAuth (optional)
|
|
||||||
GOOGLE_CLIENT_ID=
|
|
||||||
GOOGLE_CLIENT_SECRET=
|
|
||||||
|
|
||||||
# SMTP
|
|
||||||
SMTP_HOST=
|
|
||||||
SMTP_PORT=587
|
|
||||||
SMTP_USERNAME=
|
|
||||||
SMTP_PASSWORD=
|
|
||||||
SMTP_FROM_EMAIL=outline@example.com
|
|
||||||
SMTP_REPLY_EMAIL=outline@example.com
|
|
||||||
SMTP_SECURE=false
|
|
||||||
ALLOWED_DOMAINS=
|
|
||||||
|
|
@ -1,15 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
# Healthcheck for Outline Wiki stack
|
|
||||||
|
|
||||||
if ! docker ps --filter "name=outline-outline-1" --filter "status=running" | grep -q "outline-outline-1"; then
|
|
||||||
echo "[FAIL] outline container is not running"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if ! curl -sf http://localhost:3000/_health > /dev/null; then
|
|
||||||
echo "[FAIL] Outline HTTP health endpoint not responding"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "[OK] Outline is healthy"
|
|
||||||
exit 0
|
|
||||||
|
|
@ -1,36 +0,0 @@
|
||||||
service:
|
|
||||||
name: outline
|
|
||||||
owner_node: vps
|
|
||||||
exposure: public
|
|
||||||
dependencies:
|
|
||||||
- postgres
|
|
||||||
- redis
|
|
||||||
ports:
|
|
||||||
- container: 3000
|
|
||||||
host: 3000
|
|
||||||
protocol: tcp
|
|
||||||
healthcheck:
|
|
||||||
type: http
|
|
||||||
endpoint: http://localhost:3000/_health
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
restart_policy: unless-stopped
|
|
||||||
persistence:
|
|
||||||
paths:
|
|
||||||
# Docker named volumes — data stays at Docker volume paths
|
|
||||||
- volume:outline_outline_storage # /var/lib/outline/data inside container
|
|
||||||
- volume:outline_postgres_data # Postgres data directory
|
|
||||||
- volume:outline_redis_data # Redis persistence
|
|
||||||
runtime:
|
|
||||||
env_file: /opt/homelab/config/outline/.env
|
|
||||||
env_vars:
|
|
||||||
- URL
|
|
||||||
- DATABASE_URL
|
|
||||||
- REDIS_URL
|
|
||||||
- SECRET_KEY
|
|
||||||
- UTILS_SECRET
|
|
||||||
- FILE_STORAGE
|
|
||||||
- POSTGRES_USER
|
|
||||||
- POSTGRES_PASSWORD
|
|
||||||
- POSTGRES_DB
|
|
||||||
|
|
@ -5,6 +5,8 @@ WORKDIR /app
|
||||||
# No extra dependencies needed beyond standard library for the current script
|
# No extra dependencies needed beyond standard library for the current script
|
||||||
# But we might need them if we decide to use libraries later.
|
# But we might need them if we decide to use libraries later.
|
||||||
|
|
||||||
|
RUN useradd -m -u 1000 homelab
|
||||||
|
|
||||||
COPY src/stability_agent.py .
|
COPY src/stability_agent.py .
|
||||||
COPY healthcheck.sh .
|
COPY healthcheck.sh .
|
||||||
RUN chmod +x healthcheck.sh
|
RUN chmod +x healthcheck.sh
|
||||||
|
|
@ -12,5 +14,5 @@ RUN chmod +x healthcheck.sh
|
||||||
# Create the expected directories
|
# Create the expected directories
|
||||||
RUN mkdir -p /opt/homelab/state /opt/homelab/events
|
RUN mkdir -p /opt/homelab/state /opt/homelab/events
|
||||||
|
|
||||||
# Run the agent
|
USER homelab
|
||||||
CMD ["python", "stability_agent.py"]
|
CMD ["python", "stability_agent.py"]
|
||||||
|
|
|
||||||
|
|
@ -2,6 +2,9 @@ services:
|
||||||
stability-agent:
|
stability-agent:
|
||||||
build: .
|
build: .
|
||||||
container_name: stability-agent
|
container_name: stability-agent
|
||||||
|
user: "1000:1000"
|
||||||
|
group_add:
|
||||||
|
- "999"
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
volumes:
|
volumes:
|
||||||
- /opt/homelab:/opt/homelab
|
- /opt/homelab:/opt/homelab
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue