From 95135eb5f13089eb746199e43ae93750547bfa4b Mon Sep 17 00:00:00 2001
From: zap <zap@local>
Date: Fri, 13 Mar 2026 18:12:11 +0000
Subject: [PATCH] docs(subagents): record live gpt-5.4 failure verification

---
 HANDOFF.md                  | 19 +++++++++++--------
 WIP.subagent-reliability.md | 20 ++++++++++++++++++--
 memory/2026-03-13.md        | 14 +++++++++++++-
 memory/tasks.json           |  3 ++-
 4 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/HANDOFF.md b/HANDOFF.md
index 2f8c97c..3d87bf3 100644
--- a/HANDOFF.md
+++ b/HANDOFF.md
@@ -4,7 +4,7 @@
 Immediate baton-pass for the next fresh implementation session.
 
 ## Current objective
-Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The current target is to verify the newly landed upstream fix for subagent error/outcome handling and then continue on any remaining real runtime failures.
+Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The failure-path proof for the new subagent outcome handling is now captured; the remaining question is whether lower-level `agent.wait` semantics also need fixing or whether the issue is sufficiently solved at the subagent registry / completion layer.
 
 ## Use these state files first
 1. `WIP.subagent-reliability.md` — canonical state for this pass
@@ -27,15 +27,18 @@ Investigate and improve subagent / ACP delegation reliability with evidence-firs
 - An upstream patch for that error/outcome handling now exists in `external/openclaw-upstream` on branch `fix/subagent-wait-error-outcome` with targeted tests passing.
 
 ## Highest-priority next actions
-1. The success side is now verified on a real fresh `gpt-5.4` subagent run.
-2. Find and execute the smallest safe controlled-failure repro on a valid model/runtime (`gpt-5.4` preferred) so we can confirm:
-   - a failing child run is stored as `error` rather than `ok`
-   - a successful child run stores a useful frozen result / announcement payload
-3. Re-check whether ACP-specific Claude/Codex runtime failures are still reproducible after separating them from the generic subagent reporting bug.
-4. If another core bug appears, continue in `external/openclaw-upstream/` on a focused branch with targeted validation.
+1. Treat the live `gpt-5.4` failure repro as proven for subagent persistence/announcement handling:
+   - run id `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
+   - child transcript `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
+   - `runs.json` now stores `outcome.status: "error"`, `endedReason: "subagent-error"`, and a non-null `frozenResultText`
+2. Decide whether raw gateway `agent.wait` should also report `status: "error"` for terminal assistant errors. Current live evidence for the same failed child:
+   - `agent.wait` returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}`
+3. Re-check whether ACP-specific Claude/Codex runtime failures are still reproducible after separating them from the generic subagent outcome bug.
+4. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage.
 5. Update WIP + memory + tasks before ending.
 
 ## Success criteria
-- Real-run verification of the new error/outcome fix.
+- Real-run verification of the new error/outcome fix. ✅ done for subagent persistence/announcement handling.
 - Clear separation between resolved reporting bug(s) and any still-open ACP/runtime failures.
+- Explicit decision on whether raw `agent.wait` behavior is acceptable or requires a follow-up fix.
 - State files updated with paths, commands, and outcomes.
diff --git a/WIP.subagent-reliability.md b/WIP.subagent-reliability.md
index e5919a1..8e32230 100644
--- a/WIP.subagent-reliability.md
+++ b/WIP.subagent-reliability.md
@@ -61,8 +61,24 @@ This is the highest-leverage remaining open reliability item because it affects
   - result: successful completion with a real final child result (`SUCCESS-PROBE-OK`)
 - A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
   - killed/failed run id: `4965775c-4764-41e9-a77a-692f1ab4c2fd`
-- Remaining gap: we still need a controlled failure-path verification on a valid model/runtime so we can confirm failed child runs persist/announce as `error` rather than fake `ok`.
-- Next step: continue in a fresh `gpt-5.4` subagent session, find the smallest safe controlled-failure repro that does not depend on unavailable GLM-5 access, run it, and update WIP/HANDOFF with exact evidence.
+- Live failure-path verification on a valid working model/runtime is now complete on `gpt-5.4`.
+  - spawned child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
+  - requester session: `agent:main:subagent-reliability-failure-hex-1773425126098`
+  - child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
+  - child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
+  - terminal child assistant message (transcript line 6) recorded:
+    - `provider: "openai-codex"`
+    - `model: "gpt-5.4"`
+    - `stopReason: "error"`
+    - `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
+  - matching `~/.openclaw/subagents/runs.json` record now correctly persisted:
+    - `outcome.status: "error"`
+    - `outcome.error: "Codex error: {...context_length_exceeded...}"`
+    - `endedReason: "subagent-error"`
+    - `frozenResultText: "Codex error: {...context_length_exceeded...}"`
+- Important nuance from the same live repro: raw gateway `agent.wait` still returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}` for that failed child. So the current fix is verified for persisted/announced **subagent outcomes**, but **not** for the lower-level `agent.wait` RPC semantics.
+- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
+- Next step if continuing core work: decide whether `agent.wait` itself should downgrade terminal assistant errors to `status: "error"`, or whether the current contract is acceptable now that subagent registry persistence/announcements are fixed.
 
 ## Constraints
 - Prefer evidence over theory.
diff --git a/memory/2026-03-13.md b/memory/2026-03-13.md
index 9105d49..dca8a31 100644
--- a/memory/2026-03-13.md
+++ b/memory/2026-03-13.md
@@ -23,6 +23,18 @@
 - Targeted validation passed:
   - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
   - result: `50 tests` passed across `3` files
-- Follow-up still needed: rerun a real delegated subagent using a known-working model entitlement (`gpt-5.4` preferred for now) to verify successful runs leave a useful frozen result and failed runs now persist as `error`.
+- Real success-path verification later passed on `gpt-5.4` with run `23750d80-b481-4f50-b219-cc9245be405f` and final child result `SUCCESS-PROBE-OK`.
+- Real failure-path verification later also passed on valid `gpt-5.4` by intentionally triggering a `context_length_exceeded` provider error with a token-dense oversized task payload.
+  - child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
+  - child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
+  - child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
+  - transcript terminal assistant entry recorded `provider:"openai-codex"`, `model:"gpt-5.4"`, `stopReason:"error"`, `errorMessage:"Codex error: {...context_length_exceeded...}"`
+  - matching `~/.openclaw/subagents/runs.json` now correctly stored:
+    - `outcome.status: "error"`
+    - `outcome.error: "Codex error: {...context_length_exceeded...}"`
+    - `endedReason: "subagent-error"`
+    - `frozenResultText: "Codex error: {...context_length_exceeded...}"`
+- Important remaining nuance: raw gateway `agent.wait` for that same failed child still returned `status:"ok"` with only `endedAt`, so the current fix is verified for subagent outcome persistence/announcements but not for lower-level `agent.wait` semantics.
+- Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
 - Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
 - Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
diff --git a/memory/tasks.json b/memory/tasks.json
index 1f618f5..33807ab 100644
--- a/memory/tasks.json
+++ b/memory/tasks.json
@@ -27,7 +27,8 @@
       "Patch timestamp: 2026-03-04T22:31:50Z",
       "Upstream patch committed in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547 (suppress internal runtime completion context blocks in TUI formatter).",
       "Validation: pnpm test:fast completed successfully (812 files / 6599 tests passing) at 2026-03-04T22:53:29Z",
-      "2026-03-13: confirmed corrected LiteLLM run was still failing (child transcript showed assistant 429/plan error for GLM-5) while runs.json incorrectly stored outcome.status=ok and frozenResultText=null; implemented upstream branch fix/subagent-wait-error-outcome to derive terminal subagent outcome from latest assistant error state, with targeted validation (50 tests passed across 3 files). Still needs live rerun with a known-working model such as gpt-5.4."
+      "2026-03-13: confirmed corrected LiteLLM run was still failing (child transcript showed assistant 429/plan error for GLM-5) while runs.json incorrectly stored outcome.status=ok and frozenResultText=null; implemented upstream branch fix/subagent-wait-error-outcome to derive terminal subagent outcome from latest assistant error state, with targeted validation (50 tests passed across 3 files).",
+      "2026-03-13 later: live gpt-5.4 success repro passed (run 23750d80-b481-4f50-b219-cc9245be405f). Live gpt-5.4 failure repro also passed for subagent persistence/announcement handling: child run b50cb91f-6219-44f7-9d2f-a1264ac7ceaf ended with transcript stopReason=error + context_length_exceeded, and runs.json now stored outcome.status=error / endedReason=subagent-error / frozenResultText non-null. Remaining open nuance: raw agent.wait for that same failed child still returned status=ok."
     ]
   },
   {