observer: service_healthy resolves active incidents

service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.

Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Oskar Kapala 2026-05-27 15:20:19 +02:00
parent 46ae92b5c1
commit 28e9534765

View file

@ -268,8 +268,11 @@ class Observer:
# Positive confirmation from node-agent that a managed container # Positive confirmation from node-agent that a managed container
# is running. This keeps services.json populated so the supervisor # is running. This keeps services.json populated so the supervisor
# can correctly detect drift (absent entry = never reported = unknown, # can correctly detect drift (absent entry = never reported = unknown,
# not the same as confirmed missing). No incident resolution needed. # not the same as confirmed missing).
# Also resolve any active incident — if a service that had been
# unhealthy/crashing is now confirmed healthy, the incident is over.
self.world_state["services"][svc_key]["status"] = "healthy" self.world_state["services"][svc_key]["status"] = "healthy"
self._resolve_incident(svc_key, timestamp)
elif etype in ["service_unhealthy", "healthcheck_failed"]: elif etype in ["service_unhealthy", "healthcheck_failed"]:
self.world_state["services"][svc_key]["status"] = "unhealthy" self.world_state["services"][svc_key]["status"] = "unhealthy"
self._handle_incident(svc_key, event) self._handle_incident(svc_key, event)