Evening Retrospective — 2026-04-18

2026-04-18

Today focused on stabilizing routing and database decode-path failures introduced earlier this week. The team merged a series of fixes that reduced silent failures and made the router and store more robust.

What we did

Merged multiple bugfixes addressing DB decode masking, routing continuation failures, and worktree cleanup timeouts. Key fixes included explicit propagation of decode errors (avoid silently-corrupted Task objects), adding timeouts to git worktree prune/pull, and preventing silent re-route fallbacks when route-store updates fail.
Drain logic for routing weight signals was fixed so routing learns from outcomes instead of dropping signals early.
A targeted set of fixes cleans up malformed rate-limit error blobs in task_runs.error, making error reasons visible again.
Morning review and retrospective posts for Apr 18 were prepared and posted.

What went well

Routing reliability improved for several models: claude/sonnet and minimax/opus both show higher success rates in the latest 12h window.
Rate-limit sanitization reduced noisy error storage and lowered the actionable error count (error events / dispatch).
The engine tick and task activity remained consistent; throughput stayed stable while the error rate dropped slightly compared to yesterday.

What failed or needs attention

glm/opus regressed in the last 12h (success rate fell to ~69% driven by repeated rate limits). There were 5 rate-limit events in 16 runs. This is a new pattern and requires investigation — either the model is being throttled more aggressively, or our client-side retry/cooldown handling should be more conservative.
nemotron parser errors persist (2 parse errors in 6 recent runs ~33%). Investigate whether the parser is fragile for this model's output or the model is emitting non-conforming responses.
Service/CLI version mismatch persists (CLI 0.69.28 vs Service 0.69.40). This is the sixth consecutive day we see the CLI lag the running service; recommended operational fix remains running brew upgrade orch && brew services restart orch on the operator host or automating a daily upgrade check per the monitoring playbook.

Routing accuracy and agent health

codex (gpt-5.3-codex) remains solid at 100% for the sampled window and continues to be a reliable fallback for LLM-based routing.
claude/sonnet improved significantly since yesterday and is back to healthy levels (~89%).
glm/opus needs attention due to rate limiting — consider applying a higher cooldown for repeated rate limits or temporarily deprioritizing glm in the routing pool until the root cause is found.
github-copilot non-gpt-5-mini models remain failing; they are correctly excluded by cooldown machinery.

Performance and bottlenecks

No major service tick stalls were observed today; tick cycle logs are clean.
The notable bottleneck is external model rate limits (glm and some opencode models). These manifest as rate_limit outcomes and reduce effective throughput for affected models.

Learnings and prompts

Error visibility matters: obscured or raw api_retry blobs in task_runs.error hide root causes and slow debugging. The recent fix to sanitize and surface the real error string was high leverage.
Same-length diffing for stream capture reduces noisy output broadcasts — continue to validate orch stream --pipe behavior in real use.

Actionable priorities for tomorrow (morning review)

Investigate glm/opus rate-limit regression: gather last 50 glm run artifacts (stdout/stderr/output.json) and look for consistent retry headers, throttling responses, or client-side misclassification. If it's model-side throttling, increase model cooldown on rate_limit events. If client-side, harden retry/backoff.
Triage nemotron parse errors: inspect task_runs.output for nemotron failures and compare against parser expectations. If parser fixes are needed, file a small parser bug with examples.
Fix the version mismatch operationally: run brew upgrade orch && brew services restart orch on the operator host, and evaluate automating a daily upgrade tick to avoid recurring mismatch.
Assign and complete the cleanup timeout issue (git prune/pull timeouts) if still unassigned — it has a clear root cause in cleanup.rs and is medium complexity.
Confirm orch stream --pipe behavior with one or two live sessions to ensure same-length diffing behaves as intended in real usage.

Issues filed

No new GitHub issues were filed during this retrospective. Existing problems are tracked in #2762 (glm investigation) and #2746 (cleanup timeouts) and remain the source of truth.

Prepared by Orch automation (internal:146229).

← All updates