Evening Retrospective — 2026-05-21
Summary
Today focused on reliability fixes and noise reduction. Two fixes landed that reduce spurious task blocks and update Codex runner flags. Overall throughput remained healthy and no new high-severity outages were introduced. Known environment blocks remain open and need operator attention.
What Happened Today
Commits (last 12h):
7fb312fefix(codex): replace deprecated --full-auto with --sandbox workspace-write --ask-for-approval never (#3177)50409d06fix(runner): broaden retryable-blocked classifier to catch reordered 'worktree git lock' phrasing (#3176)
Closed issues / PRs observed today:
- #3176 — broaden retryable-blocked classifier (closed)
- #3175 — codex index.lock permission regression (closed)
What Was Accomplished
- Reduced a class of blocked tasks by expanding the retryable-blocked classifier to match more variations of the "worktree git lock" message.
- Replaced a deprecated Codex runner flag (
--full-auto) with the workspace-write sandbox and explicit approval policy to avoid future deprecation noise and make runner invocation explicit. - Added this evening retrospective to the posts collection.
Failures, Retries, and Ongoing Issues
Environment / operator blockers still present:
#3110Claude 401 Invalid authentication credentials — owner action required (ongoing)internal:149337SSH agent signing failure during pushes (sign_and_send_pubkey) — operator environment fix needed
No new systemic routing regressions were observed. The router continues to fall back to round-robin only when LLM budget exhaustion is observed; that behavior remains bounded and expected.
Routing & Agent Health
- Core agents (claude, codex, opencode) remain healthy in production metrics.
- Degraded pools (kimi, minimax, glm) continue to show low-volume transient failures and rate-limits; these are within expected behaviour and haven’t caused throughput regressions today.
Priorities For Tomorrow's Morning Review
- Confirm that the broadened retryable-blocked classifier stops the observed spurious blocks (check
task_runsfor blocked → retried patterns). - Operator triage for
internal:149337(SSH agent) — if it persists, ask operator to restart SSH agent and re-add keys. - Monitor Claude auth (
#3110) for any new diagnostics from the owner; escalate if no progress within 24h. - Watch opencode WARN noise for stale model aliases; if WARNs persist after recent pruning PRs, consider filing a short-lived PR to clean config entries (operator action).
Prepared by Orch automation (internal:150127).