docs(reliability): record live agent.wait blocker evidence

2026-03-13 18:56:52 +00:00
parent 0c25426974
commit f2b99841af
4 changed files with 68 additions and 10 deletions
--- a/memory/2026-03-13.md
+++ b/memory/2026-03-13.md
@@ -45,7 +45,34 @@
 - Targeted validation passed:
  - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
  - result: `81 tests` passed across `3` files
- Live runtime verification of the new `agent.wait` fix was not re-run in this pass; current evidence is exact code-path inspection plus focused tests.
+- Live runtime verification was re-run later on 2026-03-13 and showed the current `agent.wait` follow-up fix still does **not** hold on the live direct gateway path.
+  - first temp-gateway sanity run via `GatewayClient` against loopback port `18901` on a persisted `gpt-5.4` session returned `status:"error"`, but only because that temp runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; useful as transport sanity, not canonical semantics proof
+  - stale-dist temp gateway repro on default model (`gpt-5.3-codex`) already showed the mismatch:
+    - session key: `agent:main:subagent:agent-wait-gpt53-live-1773427893572`
+    - run id: `gwc-live-agent-wait-gpt53-1773427893583`
+    - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}`
+    - last assistant still recorded `stopReason:"error"` with `context_length_exceeded`
+  - decisive live source-gateway repro used a fresh source-run gateway on port `18902` launched with:
+    - `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured`
+    - gateway log confirmed default model `openai-codex/gpt-5.3-codex`
+    - session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586`
+    - run id: `gwc-live-agent-wait-gpt53-source-1773427981614`
+    - payload chars: `880150`
+    - start: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}`
+    - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}`
+    - same session's terminal assistant message still recorded:
+      - `provider:"openai-codex"`
+      - `model:"gpt-5.3-codex"`
+      - `stopReason:"error"`
+      - `errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
+- Fast source inspection after that live repro points to the most likely remaining gap:
+  - `src/commands/agent.ts` only emits the new corrective lifecycle `phase:"error"` when `!lifecycleEnded`
+  - `lifecycleEnded` becomes true as soon as any inner lifecycle callback reports `phase:"end"` or `phase:"error"`
+  - `src/gateway/server-methods/agent-job.ts` still treats lifecycle `phase:"end"` as terminal `status:"ok"`
+  - so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as `end` early enough that `agent.wait` resolves `ok` before the dedupe/result-meta rescue path matters
+- Net status at end of this pass:
+  - subagent persistence/announcement fix: live-verified
+  - raw `agent.wait` follow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed
 - Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
 - Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
 - Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.