observer: service_healthy resolves active incidents
service_healthy is a positive health confirmation — if the service had an active incident (e.g. from earlier service_unhealthy events), that incident should be resolved when the service is confirmed healthy. Previously only service_recovered resolved incidents; service_healthy set status=healthy but left incidents open, keeping status='degraded'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
46ae92b5c1
commit
28e9534765
|
|
@ -268,8 +268,11 @@ class Observer:
|
||||||
# Positive confirmation from node-agent that a managed container
|
# Positive confirmation from node-agent that a managed container
|
||||||
# is running. This keeps services.json populated so the supervisor
|
# is running. This keeps services.json populated so the supervisor
|
||||||
# can correctly detect drift (absent entry = never reported = unknown,
|
# can correctly detect drift (absent entry = never reported = unknown,
|
||||||
# not the same as confirmed missing). No incident resolution needed.
|
# not the same as confirmed missing).
|
||||||
|
# Also resolve any active incident — if a service that had been
|
||||||
|
# unhealthy/crashing is now confirmed healthy, the incident is over.
|
||||||
self.world_state["services"][svc_key]["status"] = "healthy"
|
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||||
|
self._resolve_incident(svc_key, timestamp)
|
||||||
elif etype in ["service_unhealthy", "healthcheck_failed"]:
|
elif etype in ["service_unhealthy", "healthcheck_failed"]:
|
||||||
self.world_state["services"][svc_key]["status"] = "unhealthy"
|
self.world_state["services"][svc_key]["status"] = "unhealthy"
|
||||||
self._handle_incident(svc_key, event)
|
self._handle_incident(svc_key, event)
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue