swarm-zap/WIP.subagent-reliability.md

# WIP.subagent-reliability.md

## Status
Status: `open`
Owner: `zap`
Opened: `2026-03-13`

## Purpose
Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.

## Why now
This is the highest-leverage remaining open reliability item because it affects trust in delegation and the usability of fresh implementation runs.

## Related tasks
- `task-20260304-2215-subagent-reliability` — in progress
- `task-20260304-211216-acp-claude-codex` — open

## Known context
- Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
- Upstream patch exists in `external/openclaw-upstream` on branch `fix/tui-hide-internal-runtime-context` commit `0f66a4547`.
- User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
- Fresh-session implementation discipline and monitoring thresholds were already documented locally.

## Goals for this pass
1. Establish the current failure modes with concrete evidence.
2. Separate ACP-specific failures from generic subagent/session issues.
3. Determine what is already fixed versus still broken.
4. Produce a concrete recommendation and, if feasible in one pass, implement the highest-confidence fix.
5. Update task/memory state with evidence before ending.

## Suggested investigation plan
1. Review current OpenClaw docs and local memory around subagent/ACP failures.
2. Reproduce or inspect recent failures using session/task evidence instead of guessing.
3. Check current runtime status / relevant logs / known local patches.
4. If the issue is in OpenClaw core, work in `external/openclaw-upstream/` on a focused branch.
5. Validate with the smallest reliable reproduction possible.

## Evidence gathered so far
- Fresh subagent run failed immediately when an explicit `glm-5` choice resolved into the Z.AI provider path before any useful task execution.
- Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`.
- Will clarified that Z.AI auth does exist, but this account is not entitled for `glm-5`.
- Root cause for this immediate repro is therefore best described as a provider/model entitlement mismatch caused by the explicit spawn model choice, not missing auth propagation between agents.
- A later "corrected" run using `litellm/glm-5` also did not succeed: child transcript `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` contains repeated assistant `stopReason:"error"` entries with `429 ... subscription plan does not yet include access to GLM-5`, while `~/.openclaw/subagents/runs.json` recorded that run (`776a8b51-6fdc-448e-83bc-55418814a05b`) as `outcome.status: "ok"` with `frozenResultText: null`.
- This separates the problems:
  - ACP/operator/model-selection issue: explicit `glm-5` → `zai/glm-5` without auth (already understood).
  - Generic subagent completion/reporting issue: terminal assistant errors can still be stored/announced as successful completion with no frozen result.
- Implemented upstream patch on branch `fix/subagent-wait-error-outcome` in `external/openclaw-upstream` so subagent completion paths inspect the latest assistant terminal message and treat terminal assistant errors as `outcome.status: "error"` rather than `ok`.
- Validation completed for targeted non-E2E coverage:
  - `pnpm -C external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
  - result: passed (`50 tests` across `3` files).
- E2E-style `subagent-announce.format.e2e.test.ts` coverage was updated but the normal Vitest include rules exclude `*.e2e.test.ts`; direct `pnpm test -- --run ...e2e...` confirms exclusion rather than executing that file.
- Tried to take over live verification directly in the main session on 2026-03-13:
  - confirmed upstream branch `fix/subagent-wait-error-outcome` is present with commit `2a2ed0d6f`
  - confirmed normal packaged gateway was healthy before attempting runtime verification
  - first direct hot-swap attempt was interrupted at gateway stop time; systemd restored the packaged gateway cleanly
  - no patched upstream gateway was left running after that attempt
- Current state: upstream patch + targeted tests are real.
- Real subagent success verification now completed on `gpt-5.4`:
  - run id: `23750d80-b481-4f50-b219-cc9245be405f`
  - child session: `agent:main:subagent:ad2cc776-2527-4078-ab83-0220dbd09509`
  - result: successful completion with a real final child result (`SUCCESS-PROBE-OK`)
- A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
  - killed/failed run id: `4965775c-4764-41e9-a77a-692f1ab4c2fd`
- Live failure-path verification on a valid working model/runtime is now complete on `gpt-5.4`.
  - spawned child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
  - requester session: `agent:main:subagent-reliability-failure-hex-1773425126098`
  - child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
  - child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
  - terminal child assistant message (transcript line 6) recorded:
    - `provider: "openai-codex"`
    - `model: "gpt-5.4"`
    - `stopReason: "error"`
    - `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
  - matching `~/.openclaw/subagents/runs.json` record now correctly persisted:
    - `outcome.status: "error"`
    - `outcome.error: "Codex error: {...context_length_exceeded...}"`
    - `endedReason: "subagent-error"`
    - `frozenResultText: "Codex error: {...context_length_exceeded...}"`
- Important nuance from the same live repro: raw gateway `agent.wait` still returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}` for that failed child. So the current fix is verified for persisted/announced **subagent outcomes**, but **not** for the lower-level `agent.wait` RPC semantics.
- Follow-up code inspection on 2026-03-13 found that the `agent.wait` mismatch is a real upstream bug, not intentional layering:
  - `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts` already treats terminal assistant `stopReason:"error"` as lifecycle `phase:"error"`.
  - `src/gateway/server-methods/agent-wait-dedupe.ts` now also interprets resolved agent RPC payloads with `result.meta.stopReason:"error"` as terminal `status:"error"` (and `aborted:true` as `timeout`).
  - but `src/commands/agent.ts` still had a fallback path that unconditionally emitted lifecycle `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`.
  - because `waitForAgentJob` gives lifecycle errors a retry grace window, that fallback `end` could overwrite the earlier failed state and make raw `agent.wait` resolve `status:"ok"` for a terminal assistant/provider error.
- Implemented the smallest focused upstream fix on branch `fix/subagent-wait-error-outcome`:
  - `src/commands/agent.ts` now emits lifecycle `phase:"error"` (with extracted terminal error text) when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired.
  - `src/commands/agent.test.ts` adds coverage for that fallback path.
  - `src/gateway/server-methods/agent-wait-dedupe.ts` + `agent-wait-dedupe.test.ts` cover the dedupe snapshot path so completed agent RPC payloads with terminal assistant errors/timeouts also map to `error`/`timeout` instead of staying `ok`.
- Targeted validation for this follow-up passed:
  - `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
  - result: passed (`81 tests` across `3` files).
- Remaining open item: no second live hot-swap/runtime repro was attempted in this pass, so the new `agent.wait` fix is validated by exact code-path inspection plus focused tests, not yet by another live gateway run.
- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.

## Constraints
- Prefer evidence over theory.
- Do not claim a fix without concrete validation.
- Keep the main session clean; use this file as the canonical baton.

## Success criteria
- Clear diagnosis of the current reliability problem(s).
- At least one of:
  - implemented fix with validation, or
  - sharply scoped next fix plan with exact evidence and files.
- `memory/2026-03-13.md` (or current daily note), `memory/tasks.json`, and this WIP updated.