Evening Retrospective — 2026-05-29

2026-05-29

What Happened Today

Code Changes (3 fixes merged)

Commit	PR	Description
`63d623b2`	#3203	fix(version): PID-bind service.version — closes #3200
`3c31f524`	#3206	fix(runner): move codex -c flags after exec — closes #3205
`11d623cb`	#3207	fix(runner): classify claude opus 400 thinking-block conflict as ThinkingBlockConflict — closes #3204

Critical Unblock: Service Restarted — Day 5 Streak Broken

The most urgent item from five consecutive morning reviews was finally resolved: the service was restarted and upgraded. The engine is no longer running Cellar/orch/0.73.8; it is now on 0.73.15.

Current state (23:01 UTC):

CLI:     0.73.13
Service: 40050   0.73.15  ✗ mismatch — CLI is behind, service is ahead
Latest:  0.73.16  ⚠  upgrade available

The PID-bound format in the service line confirms the new fix is live: 40050\t0.73.15 is the PID-stamped version, not the stale static file. The old ✓ in sync lie is gone — the CLI correctly reports a mismatch now.

Fixes unblocked by the restart:

#3190 (codex --ask-for-approval removal) — now deployed; codex runs should no longer produce fake $0 8-second "successes"
#3198 (CLI arg-parse masking) — now deployed; clap errors from any agent CLI will correctly fail instead of being promoted to needs_review

Fix: Codex -c Flags After exec (PR #3206, closes #3205)

-c sandbox_workspace_write.network_access=true placed before the exec subcommand was silently ignored by the top-level codex CLI — it was parsed at the wrong level and never reached the exec sandbox initialiser. Result: codex workspace-write sandbox tasks had outbound network blocked despite the config promising otherwise. Fix moves both -c overrides to appear after exec. Regression assertions added to build_command_codex so this can't regress silently.

Fix: ThinkingBlockConflict Error Class (PR #3207, closes #3204)

Claude opus was encountering transient 400 errors during multi-turn tool-use loops:

"thinking or redacted_thinking blocks in the latest assistant message cannot be modified."

Previously classified as AgentFailed — which incremented failure_count:claude:opus and fed into exponential backoff, risking premature cooldown of the highest-tier model. The new ThinkingBlockConflict variant maps to RetryableError::ModelUnavailable in the failover handler, which skips record_agent_failure_with_message and set_agent_cooldown while still re-routing. Fixture and 4 unit tests added.

This explains the 4 claude:opus failures noted in the morning review — they were transient CLI conflicts, not model capacity issues.

Fix: orch version PID-Binding (PR #3203, closes #3200)

The false "✓ in sync" issue documented over the past 5 days is now fixed and deployed. service.version is written as {pid}\t{version}, and orch version verifies the owning PID is an alive orch serve process before trusting the recorded version. A transient/failed-restart binary can no longer stamp the file while an older engine keeps running. Guard on the writer and cleanup ensure mutual exclusion by PID ownership.

Active Cooldowns (23:01 UTC)

Key	Remaining	Reason
glm	21h55m	persisted
kimi	21h18m	persisted
minimax	21h51m	persisted
opencode:github-copilot/gpt-5-mini	5d22h	persisted

The kimi/minimax/glm triple degradation continues. All three expire around ~21:00 UTC tomorrow (2026-05-30) — they've been in cooldown since yesterday. opencode:deepseek-v4-flash-free is no longer listed — it cleared as predicted.

Open Issues

None. The issue board is completely clean — no open issues in the repository.

What Went Well

The five-day critical unblock happened: service is now on 0.73.15, codex is unblocked.
Three meaningful fixes merged in one day, all with tests and proper error classification.
orch version now shows truthful output — the 40050\t0.73.15 PID-bound format confirms the fix is live.
The ThinkingBlockConflict fix prevents a real correctness hazard: opus getting wrongly cooled for a transient multi-turn loop error.
Zero open issues on the board.

What Failed and Why

Problem	Root Cause	Status
kimi/minimax/glm still cooled	Billing cycle exhaustion / agent errors from prior days	Auto-clears ~21:00 UTC tomorrow
internal:149337 blocked (19 days)	SSH agent not loaded; no `ssh-add`	OPERATOR: ssh-add required
CLI at 0.73.13, service at 0.73.15	Service upgraded further than CLI	Operator: `brew upgrade orch` on CLI machine
Latest 0.73.16 not deployed	Operator hasn't run final upgrade step	`brew update && brew upgrade orch && brew services restart orch`

Routing Accuracy

The routing system is functioning correctly. The available pool is recovering: codex is unblocked (restart deployed), deepseek-v4-flash-free cleared, claude remains strong. The kimi/minimax/glm pool will return tomorrow. ThinkingBlockConflict classification prevents false opus penalties.

No routing-level bugs identified. The pre-emptive routability checks correctly skip cooled agents.

Priorities for Tomorrow

CRITICAL (operator)

Unblock internal:149337 (Day 19):

ssh-add ~/.ssh/default_id_ed25519
orch task unblock all

Complete the upgrade to 0.73.16:

brew update && brew upgrade orch && brew services restart orch
orch version   # verify: PID-bound output, CLI and Service in sync

Monitoring

Verify codex recovery post-restart — runs should show non-zero cost, runtime >30s, real work product. A $0/8s run is still the fake-success pattern (but should no longer occur with 0.73.15 deployed).
kimi/minimax/glm cooldown expiry (~21:00 UTC) — confirm they re-enter routing pool cleanly without immediately re-entering cooldown.
Watch claude:opus ThinkingBlockConflict rate — new error class. If it occurs frequently, it warrants investigation of the multi-turn loop logic.

Maintenance

Prune dead opencode model entries from ~/.orch/config.yml (github-copilot/gpt-5.3, github-copilot/claude-opus-4.6) — reduces router WARN noise every tick.

Prepared by Orch automation (internal:150908)

← All updates