Evening Retrospective — 2026-05-29
What Happened Today
Code Changes (3 fixes merged)
| Commit | PR | Description |
|---|---|---|
63d623b2 | #3203 | fix(version): PID-bind service.version — closes #3200 |
3c31f524 | #3206 | fix(runner): move codex -c flags after exec — closes #3205 |
11d623cb | #3207 | fix(runner): classify claude opus 400 thinking-block conflict as ThinkingBlockConflict — closes #3204 |
Critical Unblock: Service Restarted — Day 5 Streak Broken
The most urgent item from five consecutive morning reviews was finally resolved: the service was restarted and upgraded. The engine is no longer running Cellar/orch/0.73.8; it is now on 0.73.15.
Current state (23:01 UTC):
CLI: 0.73.13
Service: 40050 0.73.15 ✗ mismatch — CLI is behind, service is ahead
Latest: 0.73.16 ⚠ upgrade availableThe PID-bound format in the service line confirms the new fix is live: 40050\t0.73.15 is the PID-stamped version, not the stale static file. The old ✓ in sync lie is gone — the CLI correctly reports a mismatch now.
Fixes unblocked by the restart:
- #3190 (codex
--ask-for-approvalremoval) — now deployed; codex runs should no longer produce fake$08-second "successes" - #3198 (CLI arg-parse masking) — now deployed; clap errors from any agent CLI will correctly fail instead of being promoted to
needs_review
Fix: Codex -c Flags After exec (PR #3206, closes #3205)
-c sandbox_workspace_write.network_access=true placed before the exec subcommand was silently ignored by the top-level codex CLI — it was parsed at the wrong level and never reached the exec sandbox initialiser. Result: codex workspace-write sandbox tasks had outbound network blocked despite the config promising otherwise. Fix moves both -c overrides to appear after exec. Regression assertions added to build_command_codex so this can't regress silently.
Fix: ThinkingBlockConflict Error Class (PR #3207, closes #3204)
Claude opus was encountering transient 400 errors during multi-turn tool-use loops:
"
thinkingorredacted_thinkingblocks in the latest assistant message cannot be modified."
Previously classified as AgentFailed — which incremented failure_count:claude:opus and fed into exponential backoff, risking premature cooldown of the highest-tier model. The new ThinkingBlockConflict variant maps to RetryableError::ModelUnavailable in the failover handler, which skips record_agent_failure_with_message and set_agent_cooldown while still re-routing. Fixture and 4 unit tests added.
This explains the 4 claude:opus failures noted in the morning review — they were transient CLI conflicts, not model capacity issues.
Fix: orch version PID-Binding (PR #3203, closes #3200)
The false "✓ in sync" issue documented over the past 5 days is now fixed and deployed. service.version is written as {pid}\t{version}, and orch version verifies the owning PID is an alive orch serve process before trusting the recorded version. A transient/failed-restart binary can no longer stamp the file while an older engine keeps running. Guard on the writer and cleanup ensure mutual exclusion by PID ownership.
Active Cooldowns (23:01 UTC)
| Key | Remaining | Reason |
|---|---|---|
| glm | 21h55m | persisted |
| kimi | 21h18m | persisted |
| minimax | 21h51m | persisted |
| opencode:github-copilot/gpt-5-mini | 5d22h | persisted |
The kimi/minimax/glm triple degradation continues. All three expire around ~21:00 UTC tomorrow (2026-05-30) — they've been in cooldown since yesterday. opencode:deepseek-v4-flash-free is no longer listed — it cleared as predicted.
Open Issues
None. The issue board is completely clean — no open issues in the repository.
What Went Well
- The five-day critical unblock happened: service is now on 0.73.15, codex is unblocked.
- Three meaningful fixes merged in one day, all with tests and proper error classification.
orch versionnow shows truthful output — the40050\t0.73.15PID-bound format confirms the fix is live.- The ThinkingBlockConflict fix prevents a real correctness hazard: opus getting wrongly cooled for a transient multi-turn loop error.
- Zero open issues on the board.
What Failed and Why
| Problem | Root Cause | Status |
|---|---|---|
| kimi/minimax/glm still cooled | Billing cycle exhaustion / agent errors from prior days | Auto-clears ~21:00 UTC tomorrow |
| internal:149337 blocked (19 days) | SSH agent not loaded; no ssh-add | OPERATOR: ssh-add required |
| CLI at 0.73.13, service at 0.73.15 | Service upgraded further than CLI | Operator: brew upgrade orch on CLI machine |
| Latest 0.73.16 not deployed | Operator hasn't run final upgrade step | brew update && brew upgrade orch && brew services restart orch |
Routing Accuracy
The routing system is functioning correctly. The available pool is recovering: codex is unblocked (restart deployed), deepseek-v4-flash-free cleared, claude remains strong. The kimi/minimax/glm pool will return tomorrow. ThinkingBlockConflict classification prevents false opus penalties.
No routing-level bugs identified. The pre-emptive routability checks correctly skip cooled agents.
Priorities for Tomorrow
CRITICAL (operator)
Unblock internal:149337 (Day 19):
ssh-add ~/.ssh/default_id_ed25519 orch task unblock allComplete the upgrade to 0.73.16:
brew update && brew upgrade orch && brew services restart orch orch version # verify: PID-bound output, CLI and Service in sync
Monitoring
- Verify codex recovery post-restart — runs should show non-zero cost, runtime >30s, real work product. A
$0/8s run is still the fake-success pattern (but should no longer occur with 0.73.15 deployed). - kimi/minimax/glm cooldown expiry (~21:00 UTC) — confirm they re-enter routing pool cleanly without immediately re-entering cooldown.
- Watch claude:opus ThinkingBlockConflict rate — new error class. If it occurs frequently, it warrants investigation of the multi-turn loop logic.
Maintenance
- Prune dead opencode model entries from
~/.orch/config.yml(github-copilot/gpt-5.3,github-copilot/claude-opus-4.6) — reduces router WARN noise every tick.
Prepared by Orch automation (internal:150908)