docs(reliability): record agent wait fix diagnosis

2026-03-13 18:34:18 +00:00
parent 95135eb5f1
commit 59101f674f
4 changed files with 36 additions and 9 deletions
@@ -34,7 +34,18 @@
    - `outcome.error: "Codex error: {...context_length_exceeded...}"`
    - `endedReason: "subagent-error"`
    - `frozenResultText: "Codex error: {...context_length_exceeded...}"`
- Important remaining nuance: raw gateway `agent.wait` for that same failed child still returned `status:"ok"` with only `endedAt`, so the current fix is verified for subagent outcome persistence/announcements but not for lower-level `agent.wait` semantics.
+- Important remaining nuance from the live repro: raw gateway `agent.wait` for that same failed child returned `status:"ok"` with only `endedAt` even though the child transcript terminal assistant message had `stopReason:"error"`.
+- Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional `agent.wait` layering choice:
+  - embedded subscribe lifecycle already emits `phase:"error"` for terminal assistant/provider failures
+  - but `src/commands/agent.ts` had a fallback lifecycle emitter that still sent `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`
+  - `waitForAgentJob` gives lifecycle errors a retry grace window, so that fallback `end` could overwrite the terminal failure and make `agent.wait` resolve `ok`
+- Implemented focused upstream follow-up on branch `fix/subagent-wait-error-outcome`:
+  - `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired
+  - `src/gateway/server-methods/agent-wait-dedupe.ts` now also maps completed agent dedupe payloads with `result.meta.stopReason:"error"` to `status:"error"` and `aborted:true` to `status:"timeout"`
+- Targeted validation passed:
+  - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
+  - result: `81 tests` passed across `3` files
+- Live runtime verification of the new `agent.wait` fix was not re-run in this pass; current evidence is exact code-path inspection plus focused tests.
 - Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
 - Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
 - Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
@@ -28,7 +28,8 @@
      "Upstream patch committed in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547 (suppress internal runtime completion context blocks in TUI formatter).",
      "Validation: pnpm test:fast completed successfully (812 files / 6599 tests passing) at 2026-03-04T22:53:29Z",
      "2026-03-13: confirmed corrected LiteLLM run was still failing (child transcript showed assistant 429/plan error for GLM-5) while runs.json incorrectly stored outcome.status=ok and frozenResultText=null; implemented upstream branch fix/subagent-wait-error-outcome to derive terminal subagent outcome from latest assistant error state, with targeted validation (50 tests passed across 3 files).",
-      "2026-03-13 later: live gpt-5.4 success repro passed (run 23750d80-b481-4f50-b219-cc9245be405f). Live gpt-5.4 failure repro also passed for subagent persistence/announcement handling: child run b50cb91f-6219-44f7-9d2f-a1264ac7ceaf ended with transcript stopReason=error + context_length_exceeded, and runs.json now stored outcome.status=error / endedReason=subagent-error / frozenResultText non-null. Remaining open nuance: raw agent.wait for that same failed child still returned status=ok."
+      "2026-03-13 later: live gpt-5.4 success repro passed (run 23750d80-b481-4f50-b219-cc9245be405f). Live gpt-5.4 failure repro also passed for subagent persistence/announcement handling: child run b50cb91f-6219-44f7-9d2f-a1264ac7ceaf ended with transcript stopReason=error + context_length_exceeded, and runs.json now stored outcome.status=error / endedReason=subagent-error / frozenResultText non-null. Remaining open nuance: raw agent.wait for that same failed child still returned status=ok.",
+      "2026-03-13 later: traced raw agent.wait=status:ok-on-terminal-error to an upstream bug in commands/agent.ts fallback lifecycle emission (phase:end emitted even when resolved run meta.stopReason=error). Added focused upstream fix plus dedupe-path handling/tests on branch fix/subagent-wait-error-outcome; targeted validation passed (81 tests across commands/agent.test.ts, gateway/server-methods/agent-wait-dedupe.test.ts, gateway/server-methods/server-methods.test.ts). Live verification of the new agent.wait behavior remains open."
    ]
  },
  {