docs(wip): record success probe and next failure-path pass

2026-03-13 16:40:06 +00:00
parent 5dbbc30834
commit 08c1981faa
2 changed files with 26 additions and 11 deletions
--- a/WIP.subagent-reliability.md
+++ b/WIP.subagent-reliability.md
@@ -49,7 +49,20 @@ This is the highest-leverage remaining open reliability item because it affects
  - `pnpm -C external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
  - result: passed (`50 tests` across `3` files).
 - E2E-style `subagent-announce.format.e2e.test.ts` coverage was updated but the normal Vitest include rules exclude `*.e2e.test.ts`; direct `pnpm test -- --run ...e2e...` confirms exclusion rather than executing that file.
- Next step after this patch: rerun a real subagent with a known-working model (`gpt-5.4` or another actually entitled model) and confirm `runs.json` stores `error` on terminal assistant failure and a useful frozen result on success.
+- Tried to take over live verification directly in the main session on 2026-03-13:
+  - confirmed upstream branch `fix/subagent-wait-error-outcome` is present with commit `2a2ed0d6f`
+  - confirmed normal packaged gateway was healthy before attempting runtime verification
+  - first direct hot-swap attempt was interrupted at gateway stop time; systemd restored the packaged gateway cleanly
+  - no patched upstream gateway was left running after that attempt
+- Current state: upstream patch + targeted tests are real.
+- Real subagent success verification now completed on `gpt-5.4`:
+  - run id: `23750d80-b481-4f50-b219-cc9245be405f`
+  - child session: `agent:main:subagent:ad2cc776-2527-4078-ab83-0220dbd09509`
+  - result: successful completion with a real final child result (`SUCCESS-PROBE-OK`)
+- A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
+  - killed/failed run id: `4965775c-4764-41e9-a77a-692f1ab4c2fd`
+- Remaining gap: we still need a controlled failure-path verification on a valid model/runtime so we can confirm failed child runs persist/announce as `error` rather than fake `ok`.
+- Next step: continue in a fresh `gpt-5.4` subagent session, find the smallest safe controlled-failure repro that does not depend on unavailable GLM-5 access, run it, and update WIP/HANDOFF with exact evidence.

 ## Constraints
 - Prefer evidence over theory.