observer: service_healthy resolves active incidents

service_healthy is a positive health confirmation — if the service had an active incident (e.g. from earlier service_unhealthy events), that incident should be resolved when the service is confirmed healthy. Previously only service_recovered resolved incidents; service_healthy set status=healthy but left incidents open, keeping status='degraded'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:20:19 +02:00 · 2026-05-27 15:20:19 +02:00 · 28e9534765
parent 46ae92b5c1
commit 28e9534765
1 changed files with 4 additions and 1 deletions
--- a/scripts/observer/observer.py
+++ b/scripts/observer/observer.py
@ -268,8 +268,11 @@ class Observer:
                # Positive confirmation from node-agent that a managed container
                # is running. This keeps services.json populated so the supervisor
                # can correctly detect drift (absent entry = never reported = unknown,
-                # not the same as confirmed missing). No incident resolution needed.
+                # not the same as confirmed missing).
+                # Also resolve any active incident — if a service that had been
+                # unhealthy/crashing is now confirmed healthy, the incident is over.
                self.world_state["services"][svc_key]["status"] = "healthy"
+                self._resolve_incident(svc_key, timestamp)
            elif etype in ["service_unhealthy", "healthcheck_failed"]:
                self.world_state["services"][svc_key]["status"] = "unhealthy"
                self._handle_incident(svc_key, event)