Compare commits

..

No commits in common. "master" and "main" have entirely different histories.
master ... main

16 changed files with 247 additions and 1280 deletions

View file

@ -1,43 +0,0 @@
---
name: deploy
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
---
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
## Targets
| Target | What it deploys |
|---|---|
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
| `solaria` | SOLARIA compute services |
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
## Invocation
```bash
scripts/deploy/deploy.sh <target> # full pipeline
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
```
## Exit Code Handling
| Code | Meaning | Required action |
|---|---|---|
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
## Rules
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
- Canonical branch is `master` — preflight enforces this.
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.

View file

@ -1,65 +0,0 @@
---
name: save-session
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
---
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
Never invoke proactively. Never invoke mid-task.
## 1. Determine Session Boundary
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
2. Fallback if no previous entry exists: 24 hours ago.
## 2. Collect Facts (deterministic only — no invention)
Run exactly:
```bash
# All commits since boundary
git --no-pager log --oneline <boundary>..HEAD
# Changed file summary
git --no-pager diff --stat <boundary>..HEAD
```
From the visible conversation transcript: deploys run and their outcomes, test results seen.
## 3. Write the Session Entry
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
Never overwrite existing content.
```markdown
## Session HH:MM
### Commits
<output of git log --oneline>
### Files changed
<output of git diff --stat>
### Deploys
<list from transcript, or "None recorded">
### Narrative
> _user-provided summary_
```
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
## 4. What NOT to Touch
- `backlog.md` — only on explicit "update backlog" instruction
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
- Any other file not listed above
## 5. Commit
Stage and commit **only** the session file:
```bash
git add docs/sessions/YYYY-MM-DD.md
git commit -m "docs: session YYYY-MM-DD HH:MM"
```
No other files. No `git add -A`.

View file

@ -1,81 +0,0 @@
---
name: worktree-aware
description: >
Use when working in a git worktree checkout for a parallel agent task.
The presence of an .agent-task file in the current working directory indicates
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
to the assigned task branch, NEVER push origin master, NEVER touch the main
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
completion, report the branch name verbatim and stop — the human merges via
scripts/dev/agent.sh.
---
## When this applies
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
In the main checkout these rules do not apply.
## Reading the marker
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
```yaml
task: my-feature
branch: task/my-feature
parent_commit: abc1234
created_utc: 2026-06-03T10:00:00Z
worktree_path: /home/oskar/homelab-codex-ws-my-feature
```
Always read this file first before taking any action.
## Rules
1. **Commit only to your branch.**
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
If it does not, stop immediately and report the discrepancy.
2. **Push only to your branch.**
The only permitted push is `git push origin task/<name>`.
NEVER `git push origin master` or any other branch.
3. **Do not touch the main checkout.**
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
Do not read from, write to, or execute commands inside it.
4. **Stay scoped.**
Only change files directly related to your assigned task.
If you notice other problems, report them in your final summary as separate follow-up proposals.
Do not fix them in this worktree.
5. **Never `git add -A`.**
Always stage specific files by name: `git add path/to/file`.
6. **Do not manage worktrees.**
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
Worktree lifecycle is the human's responsibility.
7. **Final report before stopping.**
When the task is done, provide a structured report containing:
- Files changed (path and one-line summary of change)
- Tests run and results
- All commit hashes on the task branch
- **Branch name verbatim** (copy-paste ready)
- Follow-up items as bulleted proposals for separate tasks
## Definition of Done
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
- Test suite passes
- Branch pushed: `git push origin task/<name>`
- Full report delivered in conversation
## What you do NOT do
- Merge branches
- Create or push tags
- Run deploys or healthchecks against production nodes
- Delete branches or worktrees
- Modify files in other worktrees
- Push to `origin master` under any circumstances

View file

@ -180,15 +180,3 @@ Before any new or changed service is considered ready:
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`) - Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
- Container names must match service names - Container names must match service names
- Always `restart: unless-stopped` unless `service.yaml` says otherwise - Always `restart: unless-stopped` unless `service.yaml` says otherwise
## Multi-agent worktree mode
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
If `.agent-task` exists in your current working directory, you are in a task worktree.
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
Agents never invoke these — only the human does.

View file

@ -1,321 +1,270 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher # deploy.sh - Staged deployment framework for homelab nodes.
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
set -uo pipefail set -o pipefail
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)" # --- Configuration ---
SSH_USER="${SSH_USER:-oskar}" export RUNTIME_PATH="/opt/homelab"
START_TIME=$(date +%s) export STATE_DIR="${RUNTIME_PATH}/state/deploy"
TARGET="" export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
DRY_RUN=false export REPO_PATH="${HOME}/homelab-codex-ws"
NO_GATE=false export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
usage() { # --- Initialization ---
cat >&2 <<'EOF' mkdir -p "$STATE_DIR" "$LOG_DIR"
Usage: deploy.sh <target> [--dry-run] [--no-gate]
Targets: # Redirection for logging
control-plane observer/supervisor/executor/operator-ui on VPS exec > >(tee -a "$LOG_FILE") 2>&1
vps all VPS GitOps services
piha PIHA services
solaria SOLARIA compute services
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
Flags: # --- Load Libraries ---
--dry-run run preflight + gate only; stop before deploy LIB_PATH="${REPO_PATH}/scripts/lib"
--no-gate skip pytest + docker build (emergency only; logged as WARNING) source "${LIB_PATH}/log.sh"
source "${LIB_PATH}/state.sh"
source "${LIB_PATH}/inventory.sh"
source "${LIB_PATH}/compose.sh"
source "${LIB_PATH}/diagnostics.sh"
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo) # --- CLI Parsing ---
EOF TARGET_HOST=$(hostname)
exit 1 TARGET_SERVICE=""
} RESUME=false
REQUESTED_STAGE=""
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case $1 in case $1 in
control-plane|vps|piha|solaria|chelsty-infra) --host)
TARGET="$1"; shift ;; TARGET_HOST="$2"
--dry-run) shift 2
DRY_RUN=true; shift ;; ;;
--no-gate) --service)
NO_GATE=true; shift ;; TARGET_SERVICE="$2"
-h|--help) shift 2
usage ;; ;;
--resume)
RESUME=true
shift
;;
--stage)
REQUESTED_STAGE="$2"
shift 2
;;
*) *)
echo "Unknown argument: $1" >&2 if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
usage ;; REQUESTED_STAGE="$1"
fi
shift
;;
esac esac
done done
[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; } # --- Stages ---
case "$TARGET" in stage_prepare() {
control-plane) SSH_HOST="vps" ;; local host=$1
*) SSH_HOST="$TARGET" ;; if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
esac log "INFO" "Skipping PREPARE (already complete)"
return 0
case "$TARGET" in
chelsty-*) SSH_TIMEOUT=30 ;;
*) SSH_TIMEOUT=5 ;;
esac
# ── PREFLIGHT ────────────────────────────────────────────────────────────────
preflight() {
echo "=== PREFLIGHT ==="
local branch
branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
if [[ "$branch" != "master" ]]; then
echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
exit 1
fi fi
echo "[ok] branch: master"
if ! git -C "$REPO_ROOT" diff --quiet; then log "INFO" "Stage: PREPARE ($host)"
echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2 set_stage "prepare"
exit 1
fi emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
exit 1
fi
echo "[ok] working tree clean"
git -C "$REPO_ROOT" fetch origin master --quiet cd "$REPO_PATH" || exit 1
local unpushed log "INFO" "Pulling latest changes..."
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline) if ! git pull; then
if [[ -n "$unpushed" ]]; then log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
echo "ERROR: Unpushed commits on master:" >&2
echo "$unpushed" >&2
echo "Push first: git push origin master" >&2
exit 1
fi fi
echo "[ok] no unpushed commits"
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..." # Ensure runtime directories exist
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \ mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2 struct_log "prepare" "$host" "all" "success" "repo_updated"
exit 1 mark_stage_complete "prepare"
fi
echo "[ok] ${SSH_HOST} reachable"
} }
# ── GATE ───────────────────────────────────────────────────────────────────── stage_validate() {
local host=$1
gate() { if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
if [[ "$NO_GATE" == "true" ]]; then log "INFO" "Skipping VALIDATE (already complete)"
echo "=== GATE: SKIPPED ==="
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
return 0 return 0
fi fi
echo "=== GATE ===" log "INFO" "Stage: VALIDATE ($host)"
set_stage "validate"
local services=() for service in "${SERVICES[@]}"; do
log "INFO" "Validating $service..."
if [[ "$TARGET" == "control-plane" ]]; then if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
services=("control-plane") log "ERROR" "Service definition not found: $service"
else struct_log "validate" "$host" "$service" "fail" "not_found"
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml" return 1
if [[ ! -f "$svc_yaml" ]]; then
echo "ERROR: ${svc_yaml} not found." >&2
exit 2
fi
local svc_list
svc_list=$(python3 -c "
import yaml
with open('${svc_yaml}') as f:
data = yaml.safe_load(f)
svcs = data.get('services', {})
if isinstance(svcs, dict):
print('\n'.join(svcs.keys()))
elif isinstance(svcs, list):
print('\n'.join(svcs))
")
while IFS= read -r svc; do
[[ -z "$svc" ]] && continue
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
services+=("$svc")
fi
done <<< "$svc_list"
fi
if [[ ${#services[@]} -eq 0 ]]; then
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
return 0
fi
echo "Services under gate: ${services[*]}"
local gate_failed=false
for svc in "${services[@]}"; do
local svc_dir="${REPO_ROOT}/services/${svc}"
if [[ -d "${svc_dir}/tests" ]]; then
echo "--- pytest: ${svc} ---"
if ! python3 -m pytest "${svc_dir}/tests" -q; then
echo "GATE FAIL: pytest failed for ${svc}" >&2
gate_failed=true
fi
fi
echo "--- docker build: ${svc} ---"
if ! docker build --quiet "${svc_dir}" >/dev/null; then
echo "GATE FAIL: docker build failed for ${svc}" >&2
gate_failed=true
fi fi
done done
if [[ "$gate_failed" == "true" ]]; then struct_log "validate" "$host" "all" "success" "validated"
exit 2 mark_stage_complete "validate"
fi
echo "[ok] gate passed"
} }
# ── EXECUTE ────────────────────────────────────────────────────────────────── stage_deploy() {
local host=$1
execute() { if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
echo "=== EXECUTE ===" log "INFO" "Skipping DEPLOY (already complete)"
return 0
local cmd_output
local cmd_exit=0
if [[ "$TARGET" == "control-plane" ]]; then
echo "Running deploy-control-plane.sh --ssh..."
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|| cmd_exit=$?
else
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" \
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|| cmd_exit=$?
fi fi
echo "$cmd_output" log "INFO" "Stage: DEPLOY ($host)"
set_stage "deploy"
if echo "$cmd_output" | grep -qF "[sudo] password"; then local last_s=$(get_last_service)
echo "" >&2 local skip=false
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2 if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
echo "Run manually:" >&2 skip=true
if [[ "$TARGET" == "control-plane" ]]; then
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
else
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
fi
exit 5
fi fi
if [[ $cmd_exit -ne 0 ]]; then for service in "${SERVICES[@]}"; do
echo "ERROR: Deploy command exited ${cmd_exit}." >&2 if [[ "$skip" == "true" ]]; then
exit 3 if [[ "$service" == "$last_s" ]]; then
fi skip=false
log "INFO" "Resuming from $service..."
echo "[ok] execute completed" else
} log "INFO" "Skipping $service (already processed)"
continue
# ── VERIFY ───────────────────────────────────────────────────────────────────
verify() {
echo "=== VERIFY ==="
local ps_output
local ps_exit=0
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" \
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|| ps_exit=$?
if [[ $ps_exit -ne 0 ]]; then
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
echo "$ps_output" >&2
exit 4
fi
echo "$ps_output"
local failed=false
local not_up
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
if [[ -n "$not_up" ]]; then
echo "ERROR: Containers not in Up state:" >&2
echo "$not_up" >&2
failed=true
fi
local unhealthy
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
if [[ -n "$unhealthy" ]]; then
echo "ERROR: Unhealthy containers:" >&2
echo "$unhealthy" >&2
failed=true
fi
if [[ "$TARGET" == "control-plane" ]]; then
for cp_svc in supervisor observer executor operator-ui; do
if ! echo "$ps_output" | grep -q "$cp_svc"; then
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
failed=true
fi fi
done fi
fi
if [[ "$failed" == "true" ]]; then log "INFO" "Deploying $service..."
echo "" >&2 set_last_service "$service"
echo "Full docker ps output above." >&2
exit 4
fi
echo "[ok] all containers healthy" if ! run_compose_up "$service"; then
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
collect_diagnostics "$host" "$service"
return 1
fi
struct_log "deploy" "$host" "$service" "success" "deployed"
done
set_last_service ""
mark_stage_complete "deploy"
} }
# ── REPORT ─────────────────────────────────────────────────────────────────── stage_verify() {
local host=$1
report() { if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
local mode="${1:-deploy}" log "INFO" "Skipping VERIFY (already complete)"
local end_time return 0
end_time=$(date +%s)
local elapsed
elapsed=$(( end_time - START_TIME ))
local commit_hash
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
local gate_s verify_s
if [[ "$NO_GATE" == "true" ]]; then
gate_s="skip"
else
gate_s="ok"
fi fi
if [[ "$mode" == "dry-run" ]]; then log "INFO" "Stage: VERIFY ($host)"
verify_s="skip(dry-run)" set_stage "verify"
else
verify_s="green"
fi
echo "" for service in "${SERVICES[@]}"; do
if [[ "$mode" == "dry-run" ]]; then log "INFO" "Verifying $service..."
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s" local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
else if [[ -f "$health_script" ]]; then
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s" if ! bash "$health_script"; then
fi log "ERROR" "Healthcheck failed for $service"
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
collect_diagnostics "$host" "$service"
return 1
fi
else
# Generic check if container is running
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
log "ERROR" "Container $service is not running"
struct_log "verify" "$host" "$service" "fail" "container_not_running"
collect_diagnostics "$host" "$service"
return 1
fi
fi
struct_log "verify" "$host" "$service" "success" "verified"
done
mark_stage_complete "verify"
} }
# ── MAIN ───────────────────────────────────────────────────────────────────── stage_complete() {
local host=$1
log "INFO" "Stage: COMPLETE ($host)"
set_stage "complete"
struct_log "complete" "$host" "all" "success" "deployment_finished"
clear_deployment_state
}
preflight # --- Execution Logic ---
gate
if [[ "$DRY_RUN" == "true" ]]; then run_deployment() {
report dry-run local start_stage=$1
exit 0
# Sequential execution from start_stage
case "$start_stage" in
prepare)
stage_prepare "$TARGET_HOST" || return 1
;&
validate)
stage_validate "$TARGET_HOST" || return 1
;&
deploy)
stage_deploy "$TARGET_HOST" || return 1
;&
verify)
stage_verify "$TARGET_HOST" || return 1
;&
complete)
stage_complete "$TARGET_HOST" || return 1
;;
*)
log "ERROR" "Invalid stage: $start_stage"
return 1
;;
esac
}
# --- Main ---
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
log "ERROR" "Failed to load inventory"
exit 1
fi fi
execute EXIT_STATUS=0
verify if [[ "$RESUME" == "true" ]]; then
report CURRENT=$(get_stage)
log "INFO" "Resuming from state: $CURRENT"
case "$CURRENT" in
prepare|validate|deploy|verify)
run_deployment "$CURRENT" || EXIT_STATUS=1
;;
complete|none)
log "INFO" "No interrupted deployment found. Starting from scratch..."
run_deployment "prepare" || EXIT_STATUS=1
;;
*)
log "INFO" "Unknown state. Starting from prepare..."
run_deployment "prepare" || EXIT_STATUS=1
;;
esac
elif [[ -n "$REQUESTED_STAGE" ]]; then
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
else
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
fi
else
# New deployment - clear previous state
clear_deployment_state
run_deployment "prepare" || EXIT_STATUS=1
fi
if [[ $EXIT_STATUS -eq 0 ]]; then
print_summary "$TARGET_HOST" "SUCCESS"
log "INFO" "--- Homelab Deployment Finished Successfully ---"
else
print_summary "$TARGET_HOST" "FAILED"
log "ERROR" "--- Homelab Deployment Failed ---"
exit 1
fi

View file

@ -1,361 +0,0 @@
#!/usr/bin/env bash
# Multi-agent worktree manager.
# EXIT: 0 ok, 1 preflight, 2 operation failed.
set -euo pipefail
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
RESERVED_NAMES=(master main HEAD list merge clean new)
MAX_WORKTREES=4
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
# ── helpers ──────────────────────────────────────────────────────────────────
is_main_checkout() {
local git_dir common_dir
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
[ "$git_dir" = "$common_dir" ]
}
require_main_checkout() {
is_main_checkout || prefail "must run from the main checkout, not a worktree"
}
require_master_branch() {
local branch
branch=$(git rev-parse --abbrev-ref HEAD)
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
}
require_clean_tree() {
local dirty
dirty=$(git status --porcelain)
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
}
worktree_paths() {
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
local main_path
main_path=$(git rev-parse --show-toplevel)
git worktree list --porcelain \
| awk '/^worktree /{p=$2} /^$/{print p}' \
| grep -v "^${main_path}$" \
|| true
}
worktree_count() {
worktree_paths | wc -l
}
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
age_str() {
local created_utc="$1"
local now_ts created_ts diff_s
now_ts=$(date -u +%s)
# strip Z, replace T with space for `date -d`
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
diff_s=$(( now_ts - created_ts ))
if (( diff_s < 60 )); then echo "${diff_s}s"
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
else echo "$(( diff_s/86400 ))d"
fi
}
validate_name() {
local name="$1"
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
fi
for r in "${RESERVED_NAMES[@]}"; do
if [ "$name" = "$r" ]; then
prefail "'$name' is a reserved word"
fi
done
}
# ── subcommands ───────────────────────────────────────────────────────────────
cmd_new() {
local name="${1:-}"
[ -n "$name" ] || { usage; exit 1; }
validate_name "$name"
require_main_checkout
require_master_branch
require_clean_tree
# worktree limit
local count
count=$(worktree_count)
if (( count >= MAX_WORKTREES )); then
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
cmd_list
exit 1
fi
# branch collision
if branch_exists_local "task/$name"; then
prefail "branch task/$name already exists locally"
fi
git fetch origin master --quiet
if branch_exists_remote "refs/heads/task/$name"; then
prefail "branch task/$name already exists on origin"
fi
# directory collision
local main_path wt_path
main_path=$(git rev-parse --show-toplevel)
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
# create worktree
git worktree add -b "task/$name" "$wt_path" origin/master \
|| die "git worktree add failed"
# write marker
local parent_commit
parent_commit=$(git rev-parse origin/master)
cat > "$wt_path/.agent-task" <<EOF
task: $name
branch: task/$name
parent_commit: $parent_commit
created_utc: $(utc_now)
worktree_path: $wt_path
EOF
echo ""
echo "Worktree created: $wt_path"
echo "Branch: task/$name"
echo ""
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
echo "─────────────────────────────────────────────────────────────────────────────"
}
cmd_list() {
local main_path
main_path=$(git rev-parse --show-toplevel)
# fetch to get up-to-date ahead/behind
git fetch origin master --quiet 2>/dev/null || true
local paths
paths=$(worktree_paths)
if [ -z "$paths" ]; then
echo "(no active task worktrees)"
return
fi
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
while IFS= read -r wt_path; do
[ -z "$wt_path" ] && continue
local marker="$wt_path/.agent-task"
local task_name branch parent_commit created_utc
if [ -f "$marker" ]; then
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
else
task_name="(no marker)"
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
parent_commit="?"
created_utc=""
fi
local status="clean"
local dirty
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
[ -n "$dirty" ] && status="dirty"
local ahead behind ab
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
ab="+${ahead}/-${behind}"
local age=""
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
local short_parent="${parent_commit:0:7}"
local short_created="${created_utc:0:10}"
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
done <<< "$paths"
}
cmd_merge() {
local name="${1:-}"
[ -n "$name" ] || { usage; exit 1; }
require_main_checkout
require_master_branch
require_clean_tree
git fetch origin --quiet
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
local main_path wt_path
main_path=$(git rev-parse --show-toplevel)
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
# attempt ff-only merge
local merge_failed=0
git merge --ff-only "task/$name" || merge_failed=1
if (( merge_failed )); then
# abort any partial merge state
git merge --abort 2>/dev/null || true
echo ""
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
echo " The branch has likely diverged from master." >&2
echo "" >&2
echo "Diagnose with:" >&2
echo " git log master..task/$name # commits only on task branch" >&2
echo " git log task/$name..master # commits master has that task doesn't" >&2
echo "" >&2
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
echo "Worktree and branch are preserved — no changes made." >&2
exit 2
fi
echo "Merged task/$name into master (fast-forward)."
git push origin master || die "git push origin master failed"
echo "Pushed master to origin."
if [ -d "$wt_path" ]; then
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
echo "Removed worktree: $wt_path"
else
echo "(worktree directory $wt_path not found — skipping worktree remove)"
fi
git branch -d "task/$name" || die "git branch -d task/$name failed"
echo "Deleted local branch task/$name."
git push origin --delete "task/$name" 2>/dev/null \
&& echo "Deleted remote branch task/$name." \
|| echo "(remote branch task/$name not found — nothing to delete)"
echo ""
echo "Done. task/$name merged and cleaned up."
}
cmd_clean() {
local main_path
main_path=$(git rev-parse --show-toplevel)
git fetch origin --quiet 2>/dev/null || true
local to_remove=()
# orphaned registered worktrees: branch deleted or fully merged into master
local paths
paths=$(worktree_paths)
while IFS= read -r wt_path; do
[ -z "$wt_path" ] && continue
local branch
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
# branch gone locally?
if ! branch_exists_local "$branch"; then
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
continue
fi
# branch fully merged into master?
local ahead
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
if [ "$ahead" = "0" ]; then
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
fi
done <<< "$paths"
# dangling directories: ../homelab-codex-ws-* not registered
local registered_paths
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
local parent_dir
parent_dir=$(dirname "$main_path")
while IFS= read -r candidate; do
[ -d "$candidate" ] || continue
if ! echo "$registered_paths" | grep -qF "$candidate"; then
to_remove+=("dangling:$candidate")
fi
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
if [ ${#to_remove[@]} -eq 0 ]; then
echo "Nothing to clean."
return 0
fi
echo "Found ${#to_remove[@]} item(s) to clean:"
for entry in "${to_remove[@]}"; do
echo " $entry"
done
echo ""
local overall_rc=0
for entry in "${to_remove[@]}"; do
local kind="${entry%%:*}"
local path="${entry#*:}"
# strip trailing annotation in parens
local raw_path
raw_path="${path%% (*}"
local confirm
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
if [[ "$confirm" =~ ^[Yy]$ ]]; then
if [ "$kind" = "worktree" ]; then
git worktree remove --force "$raw_path" 2>/dev/null \
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
else
rm -rf "$raw_path"
fi
echo " Removed."
else
echo " Skipped."
fi
done
return $overall_rc
}
usage() {
cat <<'EOF'
Usage: agent.sh <subcommand> [args]
agent.sh new <name> Create a new task worktree (branch task/<name>)
agent.sh list List active task worktrees with status
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
agent.sh clean Remove orphaned or dangling worktrees (interactive)
EXIT: 0 ok, 1 preflight, 2 operation failed.
EOF
}
# ── dispatch ──────────────────────────────────────────────────────────────────
SUBCOMMAND="${1:-}"
shift || true
case "$SUBCOMMAND" in
new) cmd_new "$@" ;;
list) cmd_list "$@" ;;
merge) cmd_merge "$@" ;;
clean) cmd_clean "$@" ;;
*) usage; exit 1 ;;
esac

View file

@ -17,24 +17,6 @@ def _atomic_write_json(path: Path, data) -> None:
os.fsync(f.fileno()) os.fsync(f.fileno())
os.replace(tmp, path) os.replace(tmp, path)
def _parse_ts(ts) -> float:
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
Events from node-agent use int(time.time()); events from stability-agent / events.py
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
last_occurrence and resolved_at, so any arithmetic on them must go through here.
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
"""
if ts is None:
return 0.0
if isinstance(ts, (int, float)):
return float(ts)
try:
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
except Exception:
return 0.0
# Constants and Paths # Constants and Paths
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab") RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
EVENTS_DIR = Path(RUNTIME_PATH) / "events" EVENTS_DIR = Path(RUNTIME_PATH) / "events"
@ -202,66 +184,32 @@ class Observer:
now = time.time() now = time.time()
try: # Auto-resolve active incidents for services that are currently healthy
# Collect incident_ids currently referenced by any service entry. # and whose last_occurrence is older than 30 minutes. These are phantom
linked_ids: set = { # incidents created by race-condition reads of truncated state files; they
svc.get("incident_id") # never receive a service_recovered event because the service was healthy
for svc in self.world_state["services"].values() # all along.
if svc.get("incident_id") for svc_key, svc in self.world_state["services"].items():
} if svc.get("status") == "healthy":
# Case 1 — service is healthy but still points at an active incident.
# process_event already calls _resolve_incident on service_healthy events,
# but if the observer restarted with on-disk state where the link was
# intact (inconsistency from a pre-atomic-write crash), it may not get
# resolved until the next service_healthy event is processed. Resolve
# immediately — a healthy service cannot have an ongoing incident.
for svc_key, svc in self.world_state["services"].items():
if svc.get("status") != "healthy":
continue
inc_id = svc.get("incident_id") inc_id = svc.get("incident_id")
if not inc_id: if inc_id and inc_id in self.world_state["incidents"]:
continue inc = self.world_state["incidents"][inc_id]
inc = self.world_state["incidents"].get(inc_id, {}) last_occ = inc.get("last_occurrence") or 0
if inc.get("status") == "active": if (inc.get("status") == "active"
logger.info( and (now - last_occ) > 1800):
f"Auto-resolving incident {inc_id} for {svc_key}: " logger.info(
f"service is healthy" f"Auto-resolving stale incident {inc_id} for {svc_key}: "
) f"service healthy, last_occurrence >{int((now - last_occ) / 60)}min ago"
inc["status"] = "resolved" )
inc["resolved_at"] = now inc["status"] = "resolved"
svc["incident_id"] = None inc["resolved_at"] = now
linked_ids.discard(inc_id) svc["incident_id"] = None
# Case 2 — orphaned active incident: no service entry links to it and
# last_occurrence is older than 5 minutes (guard against creation races).
# These are the stale records left behind when on-disk state was
# inconsistent: the service entry had incident_id cleared but incidents.json
# still had the record as "active".
for inc_id, inc in self.world_state["incidents"].items():
if inc.get("status") != "active":
continue
if inc_id in linked_ids:
continue
age = now - _parse_ts(inc.get("last_occurrence"))
if age > 300: # 5-minute guard
logger.info(
f"Auto-resolving orphaned incident {inc_id} "
f"(service={inc.get('service')}, node={inc.get('node')}): "
f"no service references it, age={int(age)}s"
)
inc["status"] = "resolved"
inc["resolved_at"] = now
except Exception as exc:
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
# Remove resolved incidents older than 7 days. # Remove resolved incidents older than 7 days.
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
stale_incidents = [ stale_incidents = [
k for k, v in self.world_state["incidents"].items() k for k, v in self.world_state["incidents"].items()
if v.get("status") == "resolved" if v.get("status") == "resolved"
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400 and (now - (v.get("resolved_at") or now)) > 7 * 86400
] ]
for k in stale_incidents: for k in stale_incidents:
del self.world_state["incidents"][k] del self.world_state["incidents"][k]

View file

@ -20,5 +20,4 @@ ENV RUNTIME_PATH=/opt/homelab
ENV PYTHONUNBUFFERED=1 ENV PYTHONUNBUFFERED=1
# Default command (will be overridden in docker-compose) # Default command (will be overridden in docker-compose)
USER homelab
CMD ["python", "src/operator_ui.py"] CMD ["python", "src/operator_ui.py"]

View file

@ -39,24 +39,10 @@ for dir in "${DIRS[@]}"; do
fi fi
done done
# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed # 3. chown/chmod for UID 1000
echo "Checking /opt/homelab ownership..." echo "Setting permissions for UID 1000 on /opt/homelab..."
_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null) sudo chown -R 1000:1000 /opt/homelab
if [[ -n "$_chown_needed" ]]; then sudo chmod -R 775 /opt/homelab 2>/dev/null || true
echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
sudo chown -R 1000:1000 /opt/homelab
else
echo "Ownership already correct, skipping chown"
fi
echo "Checking /opt/homelab directory permissions..."
_chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
if [[ -n "$_chmod_needed" ]]; then
echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
sudo chmod -R 775 /opt/homelab 2>/dev/null || true
else
echo "Permissions already correct, skipping chmod"
fi
# 4. Run docker compose up -d --build --force-recreate # 4. Run docker compose up -d --build --force-recreate
echo "--- Starting Control Plane Services ---" echo "--- Starting Control Plane Services ---"

View file

@ -56,9 +56,6 @@ services:
executor: executor:
build: . build: .
container_name: control-plane-executor container_name: control-plane-executor
user: "1000:1000"
group_add:
- "999"
command: python src/executor.py command: python src/executor.py
volumes: volumes:
- /opt/homelab:/opt/homelab - /opt/homelab:/opt/homelab

View file

@ -147,18 +147,12 @@ def current_deployments():
def current_incidents(): def current_incidents():
"""Return active incidents as a list sorted most-recent-first. """Return incidents as a list sorted most-recent-first."""
Only incidents with status='active' are returned; resolved and cancelled
records are excluded so the dashboard reflects the current operational state.
"""
raw = read_json_file(WORLD_DIR / "incidents.json", default={}) raw = read_json_file(WORLD_DIR / "incidents.json", default={})
if isinstance(raw, list): if isinstance(raw, list):
return [i for i in raw if i.get("status") == "active"] return raw
result = [] result = []
for inc in raw.values(): for inc in raw.values():
if inc.get("status") != "active":
continue
# Synthesise a human-readable message if not stored (observer doesn't set one). # Synthesise a human-readable message if not stored (observer doesn't set one).
if "message" not in inc: if "message" not in inc:
inc = dict(inc) inc = dict(inc)

View file

@ -1,333 +0,0 @@
"""Tests for incident lifecycle: auto-resolve, orphan detection, timestamp parsing."""
from __future__ import annotations
import json
import sys
import time
from pathlib import Path
import pytest
# Observer lives outside the control-plane package; add scripts/ to path.
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent / "scripts"))
from observer.observer import Observer, _parse_ts, _atomic_write_json
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _make_observer(tmp_path: Path) -> Observer:
"""Return an Observer with all runtime paths redirected to tmp_path."""
import observer.observer as obs_mod
world = tmp_path / "world"
state = tmp_path / "state"
events = tmp_path / "events"
logs = tmp_path / "logs"
repo = tmp_path / "repo"
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
d.mkdir(parents=True, exist_ok=True)
# Minimal topology so inventory isn't empty (avoids prune-guard early-return)
(repo / "inventory" / "topology.yaml").write_text(
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
)
original_world = obs_mod.WORLD_DIR
original_state = obs_mod.STATE_DIR
original_events = obs_mod.EVENTS_DIR
original_logs = obs_mod.LOGS_DIR
original_inventory = obs_mod.INVENTORY_TOPOLOGY
original_repo = obs_mod.REPO_ROOT
obs_mod.WORLD_DIR = world
obs_mod.STATE_DIR = state
obs_mod.EVENTS_DIR = events
obs_mod.LOGS_DIR = logs
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
obs_mod.REPO_ROOT = repo
obs = Observer()
# Restore module-level constants (monkeypatching at module level is sufficient
# for the Observer instance which captures paths at construction time via globals)
obs_mod.WORLD_DIR = original_world
obs_mod.STATE_DIR = original_state
obs_mod.EVENTS_DIR = original_events
obs_mod.LOGS_DIR = original_logs
obs_mod.INVENTORY_TOPOLOGY = original_inventory
obs_mod.REPO_ROOT = original_repo
return obs
def _make_observer_simple(tmp_path: Path):
"""Return an Observer instance and patch its world_state in-place."""
import observer.observer as obs_mod
world = tmp_path / "world"
state = tmp_path / "state"
events = tmp_path / "events"
logs = tmp_path / "logs"
repo = tmp_path / "repo"
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
d.mkdir(parents=True, exist_ok=True)
(repo / "inventory" / "topology.yaml").write_text(
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
)
# Patch before construction
obs_mod.WORLD_DIR = world
obs_mod.STATE_DIR = state
obs_mod.EVENTS_DIR = events
obs_mod.LOGS_DIR = logs
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
obs_mod.REPO_ROOT = repo
obs = Observer()
return obs
# ---------------------------------------------------------------------------
# 1. _parse_ts — timestamp normalisation
# ---------------------------------------------------------------------------
def test_parse_ts_int():
ts = int(time.time()) - 3600
assert abs(_parse_ts(ts) - ts) < 1
def test_parse_ts_float():
ts = time.time() - 100.5
assert abs(_parse_ts(ts) - ts) < 0.01
def test_parse_ts_iso_string():
# ISO format as emitted by events.py / stability-agent
from datetime import datetime, timezone
iso = "2026-06-01T00:03:22Z"
expected = datetime(2026, 6, 1, 0, 3, 22, tzinfo=timezone.utc).timestamp()
result = _parse_ts(iso)
assert result > 0
assert isinstance(result, float)
assert abs(result - expected) < 1
def test_parse_ts_none_returns_zero():
assert _parse_ts(None) == 0.0
def test_parse_ts_garbage_returns_zero():
assert _parse_ts("not-a-date") == 0.0
def test_parse_ts_zero_int():
assert _parse_ts(0) == 0.0
# ---------------------------------------------------------------------------
# 2. Lifecycle: service_healthy event resolves linked incident
# ---------------------------------------------------------------------------
def test_service_healthy_resolves_active_incident(tmp_path):
obs = _make_observer_simple(tmp_path)
inc_id = "inc-111-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "unhealthy", "last_check": None,
"incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "node": "vps", "service": "outline",
"status": "active", "trigger_type": "service_unhealthy",
"started_at": int(time.time()) - 600,
"last_occurrence": int(time.time()) - 600,
"occurrence_count": 1, "events": [],
}
obs.process_event({
"type": "service_healthy",
"node": "vps",
"service": "outline",
"severity": "info",
"timestamp": int(time.time()),
"payload": {},
})
assert obs.world_state["services"]["vps/outline"]["status"] == "healthy"
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_service_healthy_does_not_resolve_other_incidents(tmp_path):
"""service_healthy for service A must not touch incident for service B."""
obs = _make_observer_simple(tmp_path)
inc_b = "inc-222-vps-supervisor"
obs.world_state["services"]["vps/supervisor"] = {
"node": "vps", "service": "supervisor",
"status": "unhealthy", "last_check": None,
"incident_id": inc_b,
}
obs.world_state["incidents"][inc_b] = {
"id": inc_b, "status": "active",
"last_occurrence": int(time.time()) - 300,
}
obs.process_event({
"type": "service_healthy",
"node": "vps",
"service": "outline", # different service
"severity": "info",
"timestamp": int(time.time()),
"payload": {},
})
assert obs.world_state["incidents"][inc_b]["status"] == "active"
# ---------------------------------------------------------------------------
# 3. _prune_stale_world: healthy-service-linked incident → immediate resolve
# ---------------------------------------------------------------------------
def test_prune_resolves_healthy_linked_incident(tmp_path):
"""If a service is healthy but still points at an active incident, resolve it."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-333-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "healthy", # <-- healthy but incident_id still set
"last_check": None,
"incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"started_at": int(time.time()) - 7200,
"last_occurrence": int(time.time()) - 7200,
}
obs._prune_stale_world()
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_prune_resolves_healthy_linked_incident_iso_timestamp(tmp_path):
"""Healthy-linked incident with ISO-string last_occurrence must still resolve."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-444-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "healthy", "last_check": None, "incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": "2026-06-01T00:03:22Z", # ISO string from events.py
}
obs._prune_stale_world() # must not raise TypeError
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
# ---------------------------------------------------------------------------
# 4. _prune_stale_world: orphaned incident (no service link) → resolve after 5 min
# ---------------------------------------------------------------------------
def test_prune_resolves_orphaned_incident_old_enough(tmp_path):
"""Orphaned active incident older than 5 min must be auto-resolved."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-555-vps-supervisor"
# No service entry links to this incident
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active", "node": "vps", "service": "supervisor",
"last_occurrence": int(time.time()) - 400, # 6.7 min ago
}
obs._prune_stale_world()
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_prune_does_not_resolve_orphaned_incident_too_recent(tmp_path):
"""Orphaned incident younger than 5 min must stay active (guard against race)."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-666-vps-supervisor"
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": int(time.time()) - 60, # 1 min ago — within guard
}
obs._prune_stale_world()
assert obs.world_state["incidents"][inc_id]["status"] == "active"
def test_prune_resolves_orphaned_incident_iso_timestamp(tmp_path):
"""Orphaned incident with ISO-string last_occurrence must resolve correctly."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-777-vps-outline"
# ISO timestamp well in the past (2026-06-01)
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": "2026-06-01T00:03:22Z",
}
obs._prune_stale_world() # must not raise TypeError
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_prune_does_not_touch_linked_incident(tmp_path):
"""An active incident still linked from a non-healthy service must stay active."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-888-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "unhealthy", # <-- still unhealthy
"last_check": None,
"incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": int(time.time()) - 3600,
}
obs._prune_stale_world()
assert obs.world_state["incidents"][inc_id]["status"] == "active"
# ---------------------------------------------------------------------------
# 5. 7-day stale incident prune with ISO resolved_at
# ---------------------------------------------------------------------------
def test_prune_removes_old_resolved_incident_iso_resolved_at(tmp_path):
"""Resolved incidents with ISO-string resolved_at older than 7 days must be pruned."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-old-resolved"
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "resolved",
"resolved_at": "2026-05-01T00:00:00Z", # >7 days before 2026-06-03
}
obs._prune_stale_world()
assert inc_id not in obs.world_state["incidents"]
def test_prune_keeps_recently_resolved_incident(tmp_path):
"""Resolved incidents within 7 days must be kept."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-recent-resolved"
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "resolved",
"resolved_at": time.time() - 86400, # 1 day ago
}
obs._prune_stale_world()
assert inc_id in obs.world_state["incidents"]

View file

@ -14,11 +14,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
# pyyaml : may be needed for reading host config snippets # pyyaml : may be needed for reading host config snippets
RUN pip install --no-cache-dir "docker>=6.0" psutil pyyaml RUN pip install --no-cache-dir "docker>=6.0" psutil pyyaml
RUN useradd -m -u 1000 homelab
COPY src/ /app/src/ COPY src/ /app/src/
ENV PYTHONUNBUFFERED=1 ENV PYTHONUNBUFFERED=1
USER homelab
CMD ["python", "src/node_agent.py"] CMD ["python", "src/node_agent.py"]

View file

@ -2,9 +2,6 @@ services:
node-agent: node-agent:
build: . build: .
container_name: node-agent container_name: node-agent
user: "1000:1000"
group_add:
- "999"
restart: unless-stopped restart: unless-stopped
environment: environment:

View file

@ -5,8 +5,6 @@ WORKDIR /app
# No extra dependencies needed beyond standard library for the current script # No extra dependencies needed beyond standard library for the current script
# But we might need them if we decide to use libraries later. # But we might need them if we decide to use libraries later.
RUN useradd -m -u 1000 homelab
COPY src/stability_agent.py . COPY src/stability_agent.py .
COPY healthcheck.sh . COPY healthcheck.sh .
RUN chmod +x healthcheck.sh RUN chmod +x healthcheck.sh
@ -14,5 +12,5 @@ RUN chmod +x healthcheck.sh
# Create the expected directories # Create the expected directories
RUN mkdir -p /opt/homelab/state /opt/homelab/events RUN mkdir -p /opt/homelab/state /opt/homelab/events
USER homelab # Run the agent
CMD ["python", "stability_agent.py"] CMD ["python", "stability_agent.py"]

View file

@ -2,9 +2,6 @@ services:
stability-agent: stability-agent:
build: . build: .
container_name: stability-agent container_name: stability-agent
user: "1000:1000"
group_add:
- "999"
restart: unless-stopped restart: unless-stopped
volumes: volumes:
- /opt/homelab:/opt/homelab - /opt/homelab:/opt/homelab