Orch emits structured logs and KV-backed metrics that can be scraped by any log aggregation or monitoring pipeline to detect when the agent fleet is degraded.
Multi-Agent Degradation Alert
What it measures
Every sync tick (~45 s) the engine iterates all configured agents and checks whether each one is degraded. An agent is degraded when:
- It is in agent-level cooldown (e.g. repeated failures, credit exhaustion, silence detection), or
- All of its configured model pools are individually in model-level cooldown (every model across
simple,medium,complex, andreviewtiers is cooled).
Metrics
| KV key | Type | Written | Description |
|---|---|---|---|
metrics:orch.agents_degraded.count | gauge | every tick | Number of currently degraded agents (0 when healthy) |
metrics:orch.agents_degraded.alert | flag | every tick | "1" when count >= 3, "0" otherwise |
Read metrics from the SQLite KV store:
sqlite3 ~/.orch/orch.db \
"SELECT key, value FROM kv WHERE key LIKE 'metrics:orch.agents_degraded%';"Log signal
When count >= 3 a WARN-level structured log is emitted with the following fields:
| Field | Example | Description |
|---|---|---|
degraded_count | 3 | Total degraded agent count |
degraded_agents | ["claude", "codex", "opencode"] | Names of degraded agents |
cooled_models | claude:[opus,sonnet]; codex:[o3] | Per-agent list of individually cooled models |
cooldown_reasons | claude=silence_agent_cooldown; codex=agent_error | Per-agent cooldown reason string |
Example log line (JSON mode):
{
"level": "WARN",
"message": "multi-agent degradation detected",
"degraded_count": 3,
"degraded_agents": ["claude", "codex", "opencode"],
"cooled_models": "claude:[opus,sonnet]",
"cooldown_reasons": "claude=silence_agent_cooldown; codex=agent_error; opencode=credit_exhaustion_out_of_credits"
}Suggested alert thresholds
| Threshold | Severity | Suggested action |
|---|---|---|
count >= 1 | Info | No action required; one agent recovering is normal |
count >= 2 | Warning | Monitor; tasks will still dispatch but at reduced throughput |
count >= 3 | Page / Alert | Orch fires the built-in WARN + alert metric. Investigate cooldown reasons. Consider restarting the service or topping up credits. |
count == total | Critical | All agents degraded; no tasks can be dispatched. Immediate action required. |
Cooldown reasons reference
| Reason string | Cause | Typical duration |
|---|---|---|
agent_error | Repeated agent failures | 5 min → 15 min → 45 min → 2 h → 4 h (cap) |
model_error | Specific model failure | 5 min → 15 min → 45 min → 2 h → 4 h (cap) |
silence_agent_cooldown | Agent produced no output | 2 min |
silence_detected | Model silently exited (model-level) | 30 min–4 h |
credit_exhaustion_out_of_credits | Per-model credit exhaustion | 1 h → 3 h → 8 h (cap) |
credit_exhaustion_org_level_disabled | Org billing disabled | 2 h → 6 h → 8 h (cap) |
billing_cycle_exhausted | Monthly billing cycle quota | 24 h (flat, no backoff) |
persisted | Cooldown loaded from previous run | Remaining original duration |
Pagerduty / Alertmanager example
For systems that ingest orch's log stream (e.g. via journald, Vector, or Grafana Loki):
# Grafana Loki alert rule (example)
- alert: OrchMultiAgentDegradation
expr: |
count_over_time(
{job="orch"} |= "multi-agent degradation detected" [5m]
) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "3+ orch agents are simultaneously degraded"
description: "Check orch logs for cooldown_reasons field to identify the root cause."Or poll the KV metric directly:
# Returns "1" when alert is active
sqlite3 ~/.orch/orch.db \
"SELECT value FROM kv WHERE key = 'metrics:orch.agents_degraded.alert';"