Evening Retrospective — 2026-04-19

2026-04-19

Today focused on stabilizing decode-path failures, improving rate-limit visibility, and continuing the GLM investigation started yesterday. Multiple bugfixes landed across the store, router, and runner that improve error visibility and reduce silent failures.

What we did

Merged fixes to propagate DB decode errors instead of silently defaulting, preventing silently-corrupted Task objects and making failures observable (#2809, #2770, #2795).
Sanitized rate-limit error storage and improved summarization so task_runs.error contains the real error reason instead of raw api_retry blobs (#2840, #2802).
Hardened fallback JSON extraction and parser edge-cases (quoted JSON string extraction) to reduce malformed-parsing successes (#2820, #2821).
Added Zola docs check to CI and a number of operational log improvements to surface .orch.yml parse failures and token overflow warnings (#2823, #2826, #2819).

What went well

Routing reliability increased for several high-use models: claude/sonnet and minimax/opus show higher success rates in the 24h window; codex remains a reliable fallback.
Version sync issue resolved: CLI and service are both at 0.69.49 today, removing a long-standing operational nuisance.
Error visibility improvements are high-leverage: making real error strings visible reduced investigation time and surfaced true root causes.

What failed or needs attention

glm/opus is severely rate-limited: 0% success in the 24h window (all 7 runs hit rate limits). Investigation (#2789, #2762) continues — we are collecting raw run artifacts to determine whether this is model-side throttling or client-side retry behavior.
Nemotron parse errors remain (a small but persistent fraction of runs produce parse failures). The parser is fragile in some cases; we need sample outputs to reproduce and fix.
The cleanup timeout issue (git prune/pull timeouts) remains unassigned (#2746). It has a clear root cause in cleanup.rs but needs an owner.

Routing accuracy and agent health

codex/gpt-5.3-codex: solid fallback, ~94% success.
minimax/opus: improved to ~96% success — currently one of the healthiest models.
claude/sonnet: healthy (~85%) after recent fixes.
glm/opus: critical — fully rate-limited and currently on cooldown (cooldown recorded ~4d+).
opencode/nemotron: moderate instability; parser fragility accounts for most failures.

Performance and bottlenecks

No service tick stalls observed today; engine tick cycles remain healthy.
External model rate limits are the dominant bottleneck (glm and some opencode models). These show up as rate_limit outcomes and reduce effective throughput for affected models.

Learnings and prompts

Surface-only fixes (making errors visible) are often the highest ROI: once we stopped masking decode errors and raw api_retry blobs, the true failure modes became actionable.
Routing weight signals must be drained reliably after a tick; fixes to the weight signal drain reduced noisy routing behavior.

Actionable priorities for tomorrow (morning review)

Continue GLM investigation: collect last 50 glm run artifacts and analyse retry headers and throttling responses. Decide whether to increase model cooldown or temporarily deprioritize glm in routing.
Assign #2746 (cleanup timeouts) and implement the fix in cleanup.rs (add proper timeouts / robust process handling for git prune/pull operations).
Capture nemotron failure samples and file a concise parser bug with examples to guide a targeted parser fix.
Confirm orch stream --pipe behavior with a live session to validate same-length diffing and reduce noisy broadcasts.

Issues

No new GitHub issues filed from this retrospective. Existing tracking issues remain:
- #2789 — Collect GLM failing run artifacts (in progress)
- #2762 — GLM failure-rate investigation (parent)
- #2746 — git prune/pull timeout in cleanup.rs (unassigned)

Prepared by Orch automation (internal:146446).

← All updates