Oskar Kapala
dcacac6965
fix(planner-agent): rename OLLAMA_HOST → OLLAMA_API_BASE (litellm convention)
...
LiteLLM reads OLLAMA_API_BASE, not OLLAMA_HOST.
- llm_router.py: DEFAULT_OLLAMA_HOST → DEFAULT_OLLAMA_API_BASE, param ollama_host → ollama_api_base
- planner.py: env var os.getenv("OLLAMA_HOST") → os.getenv("OLLAMA_API_BASE"), param renamed accordingly
- /opt/homelab/config/planner-agent/.env on SOLARIA updated in-place (not in git)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:34:08 +02:00
Oskar Kapala
5ccdfa0ca6
docs: add planner-agent docs and session summary 2026-05-27
...
- services/planner-agent/README.md: full service doc (what it does,
LLM fallback chain, env vars, deploy steps, local run, redis-cli
end-to-end test, healthcheck)
- README.md: add Agent System section with all agents and their roles
- docs/sessions/2026-05-27-planner-agent.md: session summary (built
files, architectural decisions, problems + solutions, deployment
status, pending work)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 22:35:59 +02:00
Oskar Kapala
ff6fda1f04
planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
...
All runtime vars (REDIS_URL, OLLAMA_HOST, OLLAMA_MODEL, NODE_NAME,
COOLDOWN_SECONDS, RUNTIME_PATH) are sourced from the host-local
/opt/homelab/config/planner-agent/.env via env_file.
Only ANTHROPIC_API_KEY stays in environment (not in env_file — secret
injected at runtime by the operator when needed).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 22:27:44 +02:00
Oskar Kapala
ca37fca5ce
feat(planner-agent): main loop with LLM routing and HITL action proposals
...
services/planner-agent/src/planner.py:
- PlannerAgent: async Redis pub/sub on health_events + world_updates
- Pipeline: receive event → cooldown gate → LLMRouter → write pending action
→ emit remediation_started filesystem event
- CooldownTracker: 5-min suppression per svc_key (configurable via env)
- parse_event(): accepts node-agent shape A and world_updates shape B
- PROPOSAL_SCHEMA: jsonschema enforced by LLMRouter before accepting response
- SYSTEM_PROMPT: homelab topology + action rules (chelsty always requires_human,
disk_pressure always notify, confidence<0.7 → requires_human)
- write_pending_action(): atomic tmp→rename write, executor-compatible format
- emit_event(): async wrapper around filesystem event write (no control-plane import)
- _emit_event_sync() reads NODE_NAME at call time (not import) for testability
- Benign events (service_healthy, node_online, ...) silently skipped
- LLM chain failure: no cooldown recorded so next event can retry
services/planner-agent/tests/test_planner.py (49 tests, 0 network):
- TestCooldownTracker: 7 tests (ready/not-ready/elapsed/reset/independence)
- TestHealthEvent, TestActionProposal, TestMapActionToExecutorType
- TestParseEvent: both event shapes, missing fields, timestamp formats
- TestBuildMessages: system prompt rules, payload inclusion
- TestPlannerHandleEvent: benign skip, cooldown block, ignore/restart/redeploy/
notify proposals, remediation event emission, LLM failure isolation,
requires_human propagation, cooldown recording, model name in proposal
- TestPlannerDispatch: valid JSON, invalid JSON, non-string data, missing node
- TestWritePendingAction, TestEmitEvent: filesystem integration with tmp_path
services/planner-agent/service.yaml:
owner_node: solaria, dependencies: [redis, ollama]
services/planner-agent/docker-compose.yml: env + healthcheck
services/planner-agent/Dockerfile: python:3.11-slim
services/planner-agent/healthcheck.sh: heartbeat file age check (300s)
services/planner-agent/requirements.txt: litellm, redis, jsonschema, structlog
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 19:11:39 +02:00
Oskar Kapala
1bbc511bb7
feat(planner-agent): add llm_router.py with local-first fallback chain
...
services/planner-agent/src/llm_router.py:
- LLMRouter: async routing via litellm; chain = Qwen/Ollama → haiku → sonnet
- Timeouts: 8s local, 30s cloud; asyncio.wait_for belt-and-suspenders
- Rejection triggers: timeout, API error, refusal patterns, JSON schema fail
- JSON fence extraction: recovers valid JSON from blocks
- ModelMetrics: per-model success/fallback/error counters + success_rate()
- Redis publish to 'llm_router_metrics' after every call (failure-safe)
- redis_url=None disables Redis (useful in tests / edge nodes)
- context= param adds caller label to all log lines for tracing
services/planner-agent/tests/test_llm_router.py:
- 34 tests, 0 network calls (litellm + Redis fully mocked)
- Covers: primary success, JSON error fallback, refusal fallback,
timeout fallback, API exception fallback, all-fail RuntimeError,
schema validation, fence extraction, metrics recording, Redis publish,
Redis failure isolation
services/planner-agent/requirements.txt:
- litellm>=1.40.0, redis>=5.0.0, jsonschema>=4.21.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:38:06 +02:00