Gabriel Koerich Orch

Orch emits structured logs and KV-backed metrics that can be scraped by any log aggregation or monitoring pipeline to detect when the agent fleet is degraded.

Multi-Agent Degradation Alert

What it measures

Every sync tick (~45 s) the engine iterates all configured agents and checks whether each one is degraded. An agent is degraded when:

  • It is in agent-level cooldown (e.g. repeated failures, credit exhaustion, silence detection), or
  • All of its configured model pools are individually in model-level cooldown (every model across simple, medium, complex, and review tiers is cooled).

Metrics

KV keyTypeWrittenDescription
metrics:orch.agents_degraded.countgaugeevery tickNumber of currently degraded agents (0 when healthy)
metrics:orch.agents_degraded.alertflagevery tick"1" when count >= 3, "0" otherwise

Read metrics from the SQLite KV store:

sqlite3 ~/.orch/orch.db \
  "SELECT key, value FROM kv WHERE key LIKE 'metrics:orch.agents_degraded%';"

Log signal

When count >= 3 a WARN-level structured log is emitted with the following fields:

FieldExampleDescription
degraded_count3Total degraded agent count
degraded_agents["claude", "codex", "opencode"]Names of degraded agents
cooled_modelsclaude:[opus,sonnet]; codex:[o3]Per-agent list of individually cooled models
cooldown_reasonsclaude=silence_agent_cooldown; codex=agent_errorPer-agent cooldown reason string

Example log line (JSON mode):

{
  "level": "WARN",
  "message": "multi-agent degradation detected",
  "degraded_count": 3,
  "degraded_agents": ["claude", "codex", "opencode"],
  "cooled_models": "claude:[opus,sonnet]",
  "cooldown_reasons": "claude=silence_agent_cooldown; codex=agent_error; opencode=credit_exhaustion_out_of_credits"
}

Suggested alert thresholds

ThresholdSeveritySuggested action
count >= 1InfoNo action required; one agent recovering is normal
count >= 2WarningMonitor; tasks will still dispatch but at reduced throughput
count >= 3Page / AlertOrch fires the built-in WARN + alert metric. Investigate cooldown reasons. Consider restarting the service or topping up credits.
count == totalCriticalAll agents degraded; no tasks can be dispatched. Immediate action required.

Cooldown reasons reference

Reason stringCauseTypical duration
agent_errorRepeated agent failures5 min → 15 min → 45 min → 2 h → 4 h (cap)
model_errorSpecific model failure5 min → 15 min → 45 min → 2 h → 4 h (cap)
silence_agent_cooldownAgent produced no output2 min
silence_detectedModel silently exited (model-level)30 min–4 h
credit_exhaustion_out_of_creditsPer-model credit exhaustion1 h → 3 h → 8 h (cap)
credit_exhaustion_org_level_disabledOrg billing disabled2 h → 6 h → 8 h (cap)
billing_cycle_exhaustedMonthly billing cycle quota24 h (flat, no backoff)
persistedCooldown loaded from previous runRemaining original duration

Pagerduty / Alertmanager example

For systems that ingest orch's log stream (e.g. via journald, Vector, or Grafana Loki):

# Grafana Loki alert rule (example)
- alert: OrchMultiAgentDegradation
  expr: |
    count_over_time(
      {job="orch"} |= "multi-agent degradation detected" [5m]
    ) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "3+ orch agents are simultaneously degraded"
    description: "Check orch logs for cooldown_reasons field to identify the root cause."

Or poll the KV metric directly:

# Returns "1" when alert is active
sqlite3 ~/.orch/orch.db \
  "SELECT value FROM kv WHERE key = 'metrics:orch.agents_degraded.alert';"