docs(subagents): record live gpt-5.4 failure verification

2026-03-13 18:12:11 +00:00
parent 08c1981faa
commit 95135eb5f1
4 changed files with 44 additions and 12 deletions
--- a/WIP.subagent-reliability.md
+++ b/WIP.subagent-reliability.md
@@ -61,8 +61,24 @@ This is the highest-leverage remaining open reliability item because it affects
  - result: successful completion with a real final child result (`SUCCESS-PROBE-OK`)
 - A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
  - killed/failed run id: `4965775c-4764-41e9-a77a-692f1ab4c2fd`
- Remaining gap: we still need a controlled failure-path verification on a valid model/runtime so we can confirm failed child runs persist/announce as `error` rather than fake `ok`.
- Next step: continue in a fresh `gpt-5.4` subagent session, find the smallest safe controlled-failure repro that does not depend on unavailable GLM-5 access, run it, and update WIP/HANDOFF with exact evidence.
+- Live failure-path verification on a valid working model/runtime is now complete on `gpt-5.4`.
+  - spawned child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
+  - requester session: `agent:main:subagent-reliability-failure-hex-1773425126098`
+  - child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
+  - child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
+  - terminal child assistant message (transcript line 6) recorded:
+    - `provider: "openai-codex"`
+    - `model: "gpt-5.4"`
+    - `stopReason: "error"`
+    - `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
+  - matching `~/.openclaw/subagents/runs.json` record now correctly persisted:
+    - `outcome.status: "error"`
+    - `outcome.error: "Codex error: {...context_length_exceeded...}"`
+    - `endedReason: "subagent-error"`
+    - `frozenResultText: "Codex error: {...context_length_exceeded...}"`
+- Important nuance from the same live repro: raw gateway `agent.wait` still returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}` for that failed child. So the current fix is verified for persisted/announced **subagent outcomes**, but **not** for the lower-level `agent.wait` RPC semantics.
+- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
+- Next step if continuing core work: decide whether `agent.wait` itself should downgrade terminal assistant errors to `status: "error"`, or whether the current contract is acceptable now that subagent registry persistence/announcements are fixed.

 ## Constraints
 - Prefer evidence over theory.