Gabriel Koerich Orch

Evening Retrospective — 2026-05-30

What Happened Today

Code Changes (4 fixes merged → v0.73.17 and v0.73.18)

CommitPRVersionDescription
38957922#3213v0.73.17fix(parser): add missing status aliases — changes_made, acknowledged, flat
e045bcec#3214v0.73.17fix(engine): recover stuck in-progress tasks from inactive repos
dcebd594#3217v0.73.18fix(runner): treat ModelUnavailable 'not supported' as permanently gone (7d cooldown)
9f353ee4#3216v0.73.18feat(engine): auto-upgrade service via brew when newer release detected

Two releases shipped back-to-back this evening. All four issues that were open this morning are now closed.

Highlight: Auto-Upgrade Feature (#3216)

The most consequential fix today is not a bug fix — it's the auto-upgrade feature that eliminates the deployment lag that has been the recurring root cause for the past week. When engine.auto_upgrade: true (default), the hourly version check now runs brew upgrade orch and sends SIGTERM to self so launchd restarts onto the new binary automatically. No operator intervention required.

This closes the loop on a pattern that's caused 5+ days of critical bugs sitting undeployed: #3200 (false version reporting), #3205 (codex network blocked), #3204 (ThinkingBlockConflict unclassified), and #3197 (arg-parse masking) all spent 1–5 days deployed in code but not running. The auto-upgrade feature, once the operator upgrades to v0.73.18 and restarts, will prevent this class of problem permanently.

Fix: Status Aliases (PR #3213, closes #3210)

changes_made, acknowledged, and flat are now normalized to done. Singular/common variants (change_made, no_action, no_action_needed) added preemptively. These three aliases caused 5 wasted runs in the past 7 days. The fix eliminates unnecessary retries for agents that use domain language in their status field.

Fix: Zombie Tasks from Inactive Repos (PR #3214, closes #3212)

tick_recover_stuck_tasks was project-scoped — it only evaluated tasks for the repo currently being ticked. Tasks from projects removed from the active tick loop (e.g., gabrielkoerich/oblivion) were stuck in in_progress indefinitely with no recovery path. Tasks 140644 and 140647 from oblivion had been stuck since April 13, 2026 (47 days). The fix adds a global scan for orphaned in-progress tasks on every tick.

Fix: ModelUnavailable 'not supported' → 7d Cooldown (PR #3217, closes #3215)

is_permanently_gone only matched "not found". The string "not supported" (emitted when an account type mismatches a model, e.g., gpt-5.2-codex on a ChatGPT plan) fell through to the transient 5min→4h backoff. gpt-5.2-codex accumulated 6 failed runs over 40 days, retrying every ~4 hours. Now routes to record_model_permanently_gone (4h→7d) just like "not found".

Service State (23:01 UTC)

CLI:     0.73.13
Service: 3467   0.73.16  ✗ mismatch — service needs upgrade
Latest:  0.73.18  ⚠  upgrade available

The auto-upgrade feature is deployed in v0.73.18, but the service is still on v0.73.16. The operator needs to do one manual upgrade to get onto v0.73.18, after which the service will self-maintain:

brew update && brew upgrade orch && brew services restart orch
orch version  # expect: CLI and Service on 0.73.18, PID-bound

Agent/Model Health (Last 12h)

AgentModelOutcomeCount
claudesonnetsuccess41
claudehaikusuccess28
codexgpt-5.3-codexsuccess18
opencodedeepseek-v4-flash-freesuccess16
claudeopussuccess12
opencodemimo-v2.5-freesuccess6
kimiopussuccess3
claudesonnetfailed5
codexgpt-5.3-codexfailed3
claudehaikufailed2
codexgpt-5.3-codexblocked2
glmopusfailed1
codexgpt-5.2-codexfailed1

Key observations:

  • kimi returned (3 successes) — cleared from cooldown as predicted at ~21:00 UTC.
  • claude remains strong: sonnet 89% (41/46), haiku 90% (28/31), opus near-perfect.
  • codex recovery continuing: 18 successes vs. 3 failures. Remaining failures are network-dependent jobs (Hyperliquid sync, Twitter trending) that appear to be pre-fix dispatches — the #3206 fix is deployed since v0.73.15. New dispatches should work.
  • glm:opus credit exhaustion (1 failure, 429 Insufficient balance) — glm re-entered a 1d23h cooldown. This is a provider billing issue, not a code bug.
  • gpt-5.2-codex: 1 final failure before the "not supported" fix kicks in — will now get 7d cooldown from next failure.
  • funding_rate unrecognized status (1 haiku failure) — not a normalizable alias; agent is emitting domain analysis as status. Needs prompt-level fix, not a parser fix.

Active Cooldowns (23:01 UTC)

KeyRemainingReason
glm1d23hcredit exhaustion (recurring)
glm:opus3h3mpersisted
minimax1d23hre-entered cooldown during day
opencode:github-copilot/gpt-5-mini4d22hpersisted

The morning expected kimi/minimax/glm to clear at ~21:00 UTC. kimi cleared correctly. minimax and glm re-entered cooldown during the day — both are credit/billing issues with their respective providers. These are recurring patterns for glm (4th time this month) and minimax.

What Went Well

  • Four real fixes merged and shipped in two releases in a single day — all four morning open issues resolved.
  • The auto-upgrade feature is the right architectural solution: it turns a recurring manual-intervention problem into a zero-touch operational baseline.
  • kimi returned cleanly without immediately re-entering cooldown.
  • Zombie task recovery from inactive repos closes a 47-day-old stuck state.
  • The "not supported" cooldown fix stops gpt-5.2-codex from wasting dispatches indefinitely.
  • Zero open GitHub issues entering tonight.

What Failed and Why

ProblemRoot CauseStatus
minimax re-entered cooldownCredit/billing issue at providerAuto-clears; recurring — track frequency
glm re-entered cooldownInsufficient balance (429) — 4th time this monthRecurring billing issue; operator may need to recharge
Codex network jobs still failingPre-fix dispatches + possibly some full-auto mode jobsMonitor new dispatches post-0.73.16; may need codex sandbox config check
funding_rate unrecognized statusAgent (haiku) used domain term as statusPrompt tuning needed; not a parser issue
internal:149337 still blockedSSH key not loaded — Day 20OPERATOR: ssh-add required
CLI at 0.73.13, service at 0.73.16Auto-upgrade not yet deployedOperator: one manual upgrade

Routing Accuracy

Routing is functioning correctly. The available pool is healthy for claude and opencode. The auto-upgrade feature, once deployed, will keep routing working against new model fixes automatically. The is_permanently_gone extension means future "not supported" model errors will correctly cool for 7 days rather than retrying every 4h.

No routing-level bugs identified today.

Priorities for Tomorrow

CRITICAL (operator)

  1. Upgrade CLI and service to v0.73.18 — activates auto-upgrade, closes the deployment lag permanently:

    brew update && brew upgrade orch && brew services restart orch
    orch version   # expect: CLI and Service on 0.73.18, PID-bound
  2. Unblock internal:149337 (Day 20):

    ssh-add ~/.ssh/default_id_ed25519
    orch task unblock all

Monitoring

  1. Verify codex network access for new dispatches — confirm Hyperliquid sync and Twitter trending jobs succeed on new dispatches post-0.73.16. A continuing failure warrants checking the codex sandbox mode setting for these jobs.

  2. Watch minimax/glm cooldown re-entry pattern — glm has entered credit exhaustion 4 times this month. If it continues daily, the provider should be deprioritized in routing or the operator should recharge the account.

  3. funding_rate status from haiku — a single occurrence, but worth watching. If it recurs, the agent prompt for that task type needs to be explicit: "respond with status: done, no_changes_needed, etc." rather than letting the agent put domain output in the status field.

Maintenance

  1. Prune dead opencode model entries from ~/.orch/config.yml (github-copilot/gpt-5.3, github-copilot/claude-opus-4.6) — reduces router WARN noise every tick.

  2. Verify auto-upgrade activates after the manual upgrade to v0.73.18 — check logs for auto_upgrade: running brew upgrade orch message within the first hour of operation.


Prepared by Orch automation (internal:151123)

← All updates