Evening Retrospective — 2026-04-18
Today focused on stabilizing routing and database decode-path failures introduced earlier this week. The team merged a series of fixes that reduced silent failures and made the router and store more robust.
What we did
- Merged multiple bugfixes addressing DB decode masking, routing continuation failures, and worktree cleanup timeouts. Key fixes included explicit propagation of decode errors (avoid silently-corrupted Task objects), adding timeouts to git worktree prune/pull, and preventing silent re-route fallbacks when route-store updates fail.
- Drain logic for routing weight signals was fixed so routing learns from outcomes instead of dropping signals early.
- A targeted set of fixes cleans up malformed rate-limit error blobs in
task_runs.error, making error reasons visible again. - Morning review and retrospective posts for Apr 18 were prepared and posted.
What went well
- Routing reliability improved for several models: claude/sonnet and minimax/opus both show higher success rates in the latest 12h window.
- Rate-limit sanitization reduced noisy error storage and lowered the actionable error count (error events / dispatch).
- The engine tick and task activity remained consistent; throughput stayed stable while the error rate dropped slightly compared to yesterday.
What failed or needs attention
- glm/opus regressed in the last 12h (success rate fell to ~69% driven by repeated rate limits). There were 5 rate-limit events in 16 runs. This is a new pattern and requires investigation — either the model is being throttled more aggressively, or our client-side retry/cooldown handling should be more conservative.
- nemotron parser errors persist (2 parse errors in 6 recent runs ~33%). Investigate whether the parser is fragile for this model's output or the model is emitting non-conforming responses.
- Service/CLI version mismatch persists (CLI 0.69.28 vs Service 0.69.40). This is the sixth consecutive day we see the CLI lag the running service; recommended operational fix remains running
brew upgrade orch && brew services restart orchon the operator host or automating a daily upgrade check per the monitoring playbook.
Routing accuracy and agent health
- codex (gpt-5.3-codex) remains solid at 100% for the sampled window and continues to be a reliable fallback for LLM-based routing.
- claude/sonnet improved significantly since yesterday and is back to healthy levels (~89%).
- glm/opus needs attention due to rate limiting — consider applying a higher cooldown for repeated rate limits or temporarily deprioritizing glm in the routing pool until the root cause is found.
- github-copilot non-gpt-5-mini models remain failing; they are correctly excluded by cooldown machinery.
Performance and bottlenecks
- No major service tick stalls were observed today; tick cycle logs are clean.
- The notable bottleneck is external model rate limits (glm and some opencode models). These manifest as rate_limit outcomes and reduce effective throughput for affected models.
Learnings and prompts
- Error visibility matters: obscured or raw api_retry blobs in
task_runs.errorhide root causes and slow debugging. The recent fix to sanitize and surface the real error string was high leverage. - Same-length diffing for stream capture reduces noisy output broadcasts — continue to validate
orch stream --pipebehavior in real use.
Actionable priorities for tomorrow (morning review)
- Investigate glm/opus rate-limit regression: gather last 50 glm run artifacts (stdout/stderr/output.json) and look for consistent retry headers, throttling responses, or client-side misclassification. If it's model-side throttling, increase model cooldown on rate_limit events. If client-side, harden retry/backoff.
- Triage nemotron parse errors: inspect
task_runs.outputfor nemotron failures and compare against parser expectations. If parser fixes are needed, file a small parser bug with examples. - Fix the version mismatch operationally: run
brew upgrade orch && brew services restart orchon the operator host, and evaluate automating a daily upgrade tick to avoid recurring mismatch. - Assign and complete the cleanup timeout issue (git prune/pull timeouts) if still unassigned — it has a clear root cause in cleanup.rs and is medium complexity.
- Confirm
orch stream --pipebehavior with one or two live sessions to ensure same-length diffing behaves as intended in real usage.
Issues filed
- No new GitHub issues were filed during this retrospective. Existing problems are tracked in #2762 (glm investigation) and #2746 (cleanup timeouts) and remain the source of truth.
Prepared by Orch automation (internal:146229).