services/planner-agent/src/planner.py: - PlannerAgent: async Redis pub/sub on health_events + world_updates - Pipeline: receive event → cooldown gate → LLMRouter → write pending action → emit remediation_started filesystem event - CooldownTracker: 5-min suppression per svc_key (configurable via env) - parse_event(): accepts node-agent shape A and world_updates shape B - PROPOSAL_SCHEMA: jsonschema enforced by LLMRouter before accepting response - SYSTEM_PROMPT: homelab topology + action rules (chelsty always requires_human, disk_pressure always notify, confidence<0.7 → requires_human) - write_pending_action(): atomic tmp→rename write, executor-compatible format - emit_event(): async wrapper around filesystem event write (no control-plane import) - _emit_event_sync() reads NODE_NAME at call time (not import) for testability - Benign events (service_healthy, node_online, ...) silently skipped - LLM chain failure: no cooldown recorded so next event can retry services/planner-agent/tests/test_planner.py (49 tests, 0 network): - TestCooldownTracker: 7 tests (ready/not-ready/elapsed/reset/independence) - TestHealthEvent, TestActionProposal, TestMapActionToExecutorType - TestParseEvent: both event shapes, missing fields, timestamp formats - TestBuildMessages: system prompt rules, payload inclusion - TestPlannerHandleEvent: benign skip, cooldown block, ignore/restart/redeploy/ notify proposals, remediation event emission, LLM failure isolation, requires_human propagation, cooldown recording, model name in proposal - TestPlannerDispatch: valid JSON, invalid JSON, non-string data, missing node - TestWritePendingAction, TestEmitEvent: filesystem integration with tmp_path services/planner-agent/service.yaml: owner_node: solaria, dependencies: [redis, ollama] services/planner-agent/docker-compose.yml: env + healthcheck services/planner-agent/Dockerfile: python:3.11-slim services/planner-agent/healthcheck.sh: heartbeat file age check (300s) services/planner-agent/requirements.txt: litellm, redis, jsonschema, structlog Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
29 lines
788 B
Bash
29 lines
788 B
Bash
#!/bin/sh
|
|
# Healthcheck: verify the planner-agent heartbeat is fresh.
|
|
# The planner touches /opt/homelab/state/planner-agent.heartbeat
|
|
# at the top of every poll cycle (≤5 s intervals).
|
|
# We fail if it is older than 300 s (5 min = one full cooldown window).
|
|
|
|
HEARTBEAT_FILE="${RUNTIME_PATH:-/opt/homelab}/state/planner-agent.heartbeat"
|
|
MAX_AGE_SECONDS=300
|
|
|
|
if [ ! -f "$HEARTBEAT_FILE" ]; then
|
|
echo "FAIL: heartbeat file missing: $HEARTBEAT_FILE"
|
|
exit 1
|
|
fi
|
|
|
|
NOW=$(date +%s)
|
|
FILE_TIME=$(stat -c %Y "$HEARTBEAT_FILE" 2>/dev/null) || {
|
|
echo "FAIL: cannot stat heartbeat file"
|
|
exit 1
|
|
}
|
|
AGE=$((NOW - FILE_TIME))
|
|
|
|
if [ "$AGE" -gt "$MAX_AGE_SECONDS" ]; then
|
|
echo "FAIL: heartbeat stale (${AGE}s > ${MAX_AGE_SECONDS}s)"
|
|
exit 1
|
|
fi
|
|
|
|
echo "OK: heartbeat age ${AGE}s"
|
|
exit 0
|