homelab-codex-ws

Author	SHA1	Message	Date
Oskar Kapala	5ccdfa0ca6	docs: add planner-agent docs and session summary 2026-05-27 - services/planner-agent/README.md: full service doc (what it does, LLM fallback chain, env vars, deploy steps, local run, redis-cli end-to-end test, healthcheck) - README.md: add Agent System section with all agents and their roles - docs/sessions/2026-05-27-planner-agent.md: session summary (built files, architectural decisions, problems + solutions, deployment status, pending work) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 22:35:59 +02:00
Oskar Kapala	ff6fda1f04	planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment All runtime vars (REDIS_URL, OLLAMA_HOST, OLLAMA_MODEL, NODE_NAME, COOLDOWN_SECONDS, RUNTIME_PATH) are sourced from the host-local /opt/homelab/config/planner-agent/.env via env_file. Only ANTHROPIC_API_KEY stays in environment (not in env_file — secret injected at runtime by the operator when needed). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 22:27:44 +02:00
Oskar Kapala	ca37fca5ce	feat(planner-agent): main loop with LLM routing and HITL action proposals services/planner-agent/src/planner.py: - PlannerAgent: async Redis pub/sub on health_events + world_updates - Pipeline: receive event → cooldown gate → LLMRouter → write pending action → emit remediation_started filesystem event - CooldownTracker: 5-min suppression per svc_key (configurable via env) - parse_event(): accepts node-agent shape A and world_updates shape B - PROPOSAL_SCHEMA: jsonschema enforced by LLMRouter before accepting response - SYSTEM_PROMPT: homelab topology + action rules (chelsty always requires_human, disk_pressure always notify, confidence<0.7 → requires_human) - write_pending_action(): atomic tmp→rename write, executor-compatible format - emit_event(): async wrapper around filesystem event write (no control-plane import) - _emit_event_sync() reads NODE_NAME at call time (not import) for testability - Benign events (service_healthy, node_online, ...) silently skipped - LLM chain failure: no cooldown recorded so next event can retry services/planner-agent/tests/test_planner.py (49 tests, 0 network): - TestCooldownTracker: 7 tests (ready/not-ready/elapsed/reset/independence) - TestHealthEvent, TestActionProposal, TestMapActionToExecutorType - TestParseEvent: both event shapes, missing fields, timestamp formats - TestBuildMessages: system prompt rules, payload inclusion - TestPlannerHandleEvent: benign skip, cooldown block, ignore/restart/redeploy/ notify proposals, remediation event emission, LLM failure isolation, requires_human propagation, cooldown recording, model name in proposal - TestPlannerDispatch: valid JSON, invalid JSON, non-string data, missing node - TestWritePendingAction, TestEmitEvent: filesystem integration with tmp_path services/planner-agent/service.yaml: owner_node: solaria, dependencies: [redis, ollama] services/planner-agent/docker-compose.yml: env + healthcheck services/planner-agent/Dockerfile: python:3.11-slim services/planner-agent/healthcheck.sh: heartbeat file age check (300s) services/planner-agent/requirements.txt: litellm, redis, jsonschema, structlog Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 19:11:39 +02:00
Oskar Kapala	1bbc511bb7	feat(planner-agent): add llm_router.py with local-first fallback chain services/planner-agent/src/llm_router.py: - LLMRouter: async routing via litellm; chain = Qwen/Ollama → haiku → sonnet - Timeouts: 8s local, 30s cloud; asyncio.wait_for belt-and-suspenders - Rejection triggers: timeout, API error, refusal patterns, JSON schema fail - JSON fence extraction: recovers valid JSON from blocks - ModelMetrics: per-model success/fallback/error counters + success_rate() - Redis publish to 'llm_router_metrics' after every call (failure-safe) - redis_url=None disables Redis (useful in tests / edge nodes) - context= param adds caller label to all log lines for tracing services/planner-agent/tests/test_llm_router.py: - 34 tests, 0 network calls (litellm + Redis fully mocked) - Covers: primary success, JSON error fallback, refusal fallback, timeout fallback, API exception fallback, all-fail RuntimeError, schema validation, fence extraction, metrics recording, Redis publish, Redis failure isolation services/planner-agent/requirements.txt: - litellm>=1.40.0, redis>=5.0.0, jsonschema>=4.21.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-27 18:38:06 +02:00

4 commits