homelab-codex-ws/services/planner-agent/healthcheck.sh
Oskar Kapala ca37fca5ce feat(planner-agent): main loop with LLM routing and HITL action proposals
services/planner-agent/src/planner.py:
- PlannerAgent: async Redis pub/sub on health_events + world_updates
- Pipeline: receive event → cooldown gate → LLMRouter → write pending action
  → emit remediation_started filesystem event
- CooldownTracker: 5-min suppression per svc_key (configurable via env)
- parse_event(): accepts node-agent shape A and world_updates shape B
- PROPOSAL_SCHEMA: jsonschema enforced by LLMRouter before accepting response
- SYSTEM_PROMPT: homelab topology + action rules (chelsty always requires_human,
  disk_pressure always notify, confidence<0.7 → requires_human)
- write_pending_action(): atomic tmp→rename write, executor-compatible format
- emit_event(): async wrapper around filesystem event write (no control-plane import)
- _emit_event_sync() reads NODE_NAME at call time (not import) for testability
- Benign events (service_healthy, node_online, ...) silently skipped
- LLM chain failure: no cooldown recorded so next event can retry

services/planner-agent/tests/test_planner.py (49 tests, 0 network):
- TestCooldownTracker: 7 tests (ready/not-ready/elapsed/reset/independence)
- TestHealthEvent, TestActionProposal, TestMapActionToExecutorType
- TestParseEvent: both event shapes, missing fields, timestamp formats
- TestBuildMessages: system prompt rules, payload inclusion
- TestPlannerHandleEvent: benign skip, cooldown block, ignore/restart/redeploy/
  notify proposals, remediation event emission, LLM failure isolation,
  requires_human propagation, cooldown recording, model name in proposal
- TestPlannerDispatch: valid JSON, invalid JSON, non-string data, missing node
- TestWritePendingAction, TestEmitEvent: filesystem integration with tmp_path

services/planner-agent/service.yaml:
  owner_node: solaria, dependencies: [redis, ollama]
services/planner-agent/docker-compose.yml: env + healthcheck
services/planner-agent/Dockerfile: python:3.11-slim
services/planner-agent/healthcheck.sh: heartbeat file age check (300s)
services/planner-agent/requirements.txt: litellm, redis, jsonschema, structlog

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 19:11:39 +02:00

29 lines
788 B
Bash

#!/bin/sh
# Healthcheck: verify the planner-agent heartbeat is fresh.
# The planner touches /opt/homelab/state/planner-agent.heartbeat
# at the top of every poll cycle (≤5 s intervals).
# We fail if it is older than 300 s (5 min = one full cooldown window).
HEARTBEAT_FILE="${RUNTIME_PATH:-/opt/homelab}/state/planner-agent.heartbeat"
MAX_AGE_SECONDS=300
if [ ! -f "$HEARTBEAT_FILE" ]; then
echo "FAIL: heartbeat file missing: $HEARTBEAT_FILE"
exit 1
fi
NOW=$(date +%s)
FILE_TIME=$(stat -c %Y "$HEARTBEAT_FILE" 2>/dev/null) || {
echo "FAIL: cannot stat heartbeat file"
exit 1
}
AGE=$((NOW - FILE_TIME))
if [ "$AGE" -gt "$MAX_AGE_SECONDS" ]; then
echo "FAIL: heartbeat stale (${AGE}s > ${MAX_AGE_SECONDS}s)"
exit 1
fi
echo "OK: heartbeat age ${AGE}s"
exit 0