Gabriel Koerich Orch

Daily Review — 2026-06-10

What Shipped (Last 24h)

4 commits landed today, all parser/docs fixes:

CommitDescription
b5c1289afix(parser): add 'healthy' to done status aliases in normalize_status (#3298)
4be9f1eadocs(posts): update daily review for 2026-06-09 evening session (#3294)
79f65f16Self-improvement: debug agent errors and fix root causes (#3293)
e2a8986cbug(parser): normalize_status missing 'NO_SETUPS', 'alerts' (plural), 'not_configured' (#3292)

Service version: v0.80.7 — 3 versions behind v0.80.10 per issue #3297. All parser fixes above are committed to main but not yet deployed. The deployment cycle was not run after any of these merges.

Issues Closed

IssueTitle
#3295bug(parser): normalize_status missing 'healthy' — fixed by b5c1289a
#3291bug(parser): normalize_status missing 'NO_SETUPS', 'alerts', 'not_configured' — fixed by e2a8986c
#3287bug(runner): all-agents-exhausted sets task to 'needs_review' instead of reset
#3286bug(router): router LLM ignores agent-level cooldown in call_router_llm
#3283bug(runner): claude "weekly limit" misclassified as failed
#3281control: split oversized messages into chunks instead of truncating
#3274bug(runner): opencode false-positive rate_limit when agent runs cargo test
#3273bug(parser): normalize_status missing 'noop', 'green', 'waiting', 'alerts_sent'
#3272bug(runner): claude 'session limit' still classified as failed
#3271bug(router): 'ALL AGENTS COOLED' fires when agent is only temporarily cooled
#3268orch commit: generate messages via LLM agent instead of pattern matching
#3267Trim top-level CLI (phase 2): stats group, hide internals
#3264Trim top-level CLI: hide serve, nest cron/events/session
#3263Add /restart command to control plane (Telegram, Discord)
#3259bug(review): empty-branch tasks loop in needs_review
#3256bug(runner): minimax/kimi/glm 429 API errors classified incorrectly

Operational Health

Task Run Summary (Last 24h)

AgentModelSuccessFailedOther
claudesonnet16681 parse_error, 1 rate_limit, 2 blank
opencodenemotron-3-ultra-free28121 rate_limit, 1 timeout
opencodedeepseek-v4-flash-free2131 timeout
opencodemimo-v2.5-free1221 parse_error
opencodenorth-mini-code-free111 parse_error, 1 rate_limit
kimiopus41 rate_limit
minimaxopus4
codexgpt-5.51

Total dispatches (24h): 350 · 127 PR creates · 122 review decisions · 134 review starts · 42 errors · 21 reroutes · 3 timeouts.

Claude/sonnet dominated the workload — 166 successes = 47% of all dispatches. Codex/kimi/minimax remain degraded, pushing extra load onto claude and opencode free-tier.

Agent Pool Health

AgentStatusCooldown RemainingReason
codexDegraded + cooled~1h29mPersisted (billing/usage limit)
kimiDegraded + cooled~13h54mPersisted (billing cycle)
kimi:haikuCooled~13h43mPersisted
minimaxDegraded + cooled~13h31mPersisted (429 code 2056)
claude:haikuCooled~1d14hPersisted
opencode/mimo-v2.5-freeCooled~8h39mAgent error
opencodeRecoveredDegraded flag cleared at 22:57 UTC

Effective routing pool: claude/sonnet (primary workhorse) + opencode free-tier (nemotron-3, deepseek-v4-flash, north-mini-code).

Codex recovery imminent — cooldown expires in ~1.5h. Watch for gpt-5.5 routing to resume correctly.

Key Error Patterns

  1. Service deployment lag (open: #3297) — v0.80.7 is 3 versions behind v0.80.10. Parser fixes for healthy, NO_SETUPS, alerts, not_configured are merged but undeployed. Live service still treats these statuses as unknown, causing false task failures on every health check run. This is the highest-priority operational issue.

  2. Review rebroadcast → Blocked escalation (open: #3296)sync.rs:1382 escalates to Blocked after 5 review refires regardless of whether the escalation is due to agent cooldown. Tasks 153022 and 153015 were blocked while agents would have recovered in 2h. Fix proposed: check agent cooldown before escalating; reset refire counter instead of blocking if agents are temporarily unavailable.

  3. Claude/sonnet failure rate (8/174 = 5%) — Down from 13% in last session. Still handling tasks designed for codex/kimi/minimax. Worth monitoring if failure rate climbs as codex recovers and more complex tasks route through.

  4. Router LLM pool timeout — Committed fix (#3289) eliminates 45s wasted timeout on minimax/haiku during routing. Not yet deployed. Each routing tick that hits the timed-out pool loses 45s before falling back to weighted round-robin. The tick watchdog fired at 79s on this very task's dispatch (23:01 UTC: minimax/haiku → timeout → fallback → claude/sonnet route).

  5. Watchdog stall at task creation — Same pattern as yesterday: daily-review task creation (23:00 UTC) triggered a routing tick that stalled 79s due to router LLM pool timeout. Will be resolved by deploying #3289.

  6. Error log empty/opt/homebrew/var/log/orch.error.log is 0B. No startup panics or unhandled errors in the current service instance.

Stuck / Blocked Tasks

Currently In Progress

TaskTitleAgentStatus
internal:153102Daily review (this task)claude/sonnetin_progress
internal:153103Daily evening retrospectiveclaude/sonnetin_progress
internal:153104Hyperliquid: owner position state monitornew (queued)

Stale Blocked (Not Operational)

  • 148985 — Research: Anthropic prompt framework (37d, blocked, needs human review)
  • 149038 — Research: Monitor USDPT on Solana (36d, blocked, needs human review)
  • ~30+ bean/oblivion security audit findings — Blocked since April. Need human triage to close or retry.

Routing Accuracy

  • LLM routing: Degraded until deployment. Live service still attempts cooled minimax/haiku in LLM pool, timing out at 45s. Fix #3289 merged, undeployed.
  • Weighted round-robin fallback: Working correctly. Selects claude/sonnet (weight 0.2) after LLM pool failure.
  • Cooldown system: Correctly persisting codex/kimi/minimax cooldowns. Opencode degraded flag auto-cleared at 22:57 UTC when rate limit count dropped.
  • Parser normalize_status: healthy alias fix committed (b5c1289a) — live service will continue failing health-check tasks until deployed.

Priorities for Tomorrow (2026-06-11)

  1. DEPLOY IMMEDIATELY — Run the full deployment cycle to pick up v0.80.10 (or latest):

    git push origin main          # ensure nothing local is ahead
    brew update && brew upgrade orch
    brew services restart orch
    orch -V                        # verify version

    This unblocks: parser healthy/NO_SETUPS/alerts fixes, router LLM pool cooldown check (#3289), all-agents-exhausted reset (#3290).

  2. Monitor codex recovery — Cooldown expires in ~1.5h from now (~00:30 UTC Jun 11). Verify gpt-5.5 routes correctly. Watch routing weight restoration.

  3. Fix #3296 (review rebroadcast → Blocked)src/engine/sync.rs:1382 should check agent cooldown before escalating. Tasks blocked Jun 10 while agents would recover in 2h. This is a recurring correctness bug.

  4. Monitor kimi/minimax — Both on ~14h cooldowns. When they recover, verify LLM pool picks them back up and routing weights restore.

  5. Triage stale bean/oblivion blocked tasks — ~30 security audit findings from April sitting blocked. Not actionable by agents — needs human decision to close or retry.


Prepared by internal:153102 (routed claude/sonnet via weighted round-robin fallback after minimax/haiku LLM pool timed out at 45s).

← All updates