WIP.subagent-reliability.md

Status

Status: open Owner: zap Opened: 2026-03-13

Purpose

Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.

Why now

This is the highest-leverage remaining open reliability item because it affects trust in delegation and the usability of fresh implementation runs.

task-20260304-2215-subagent-reliability — in progress
task-20260304-211216-acp-claude-codex — open

Known context

Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
Upstream patch exists in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547.
User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
Fresh-session implementation discipline and monitoring thresholds were already documented locally.

Goals for this pass

Establish the current failure modes with concrete evidence.
Separate ACP-specific failures from generic subagent/session issues.
Determine what is already fixed versus still broken.
Produce a concrete recommendation and, if feasible in one pass, implement the highest-confidence fix.
Update task/memory state with evidence before ending.

Suggested investigation plan

Review current OpenClaw docs and local memory around subagent/ACP failures.
Reproduce or inspect recent failures using session/task evidence instead of guessing.
Check current runtime status / relevant logs / known local patches.
If the issue is in OpenClaw core, work in external/openclaw-upstream/ on a focused branch.
Validate with the smallest reliable reproduction possible.

Evidence gathered so far

Fresh subagent run failed immediately when an explicit glm-5 choice resolved into the Z.AI provider path before any useful task execution.
Current installed agent auth profile keys inspected in agent stores include openai-codex:default, litellm:default, and github-copilot:github.
Will clarified that Z.AI auth does exist, but this account is not entitled for glm-5.
Root cause for this immediate repro is therefore best described as a provider/model entitlement mismatch caused by the explicit spawn model choice, not missing auth propagation between agents.
A later "corrected" run using litellm/glm-5 also did not succeed: child transcript ~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl contains repeated assistant stopReason:"error" entries with 429 ... subscription plan does not yet include access to GLM-5, while ~/.openclaw/subagents/runs.json recorded that run (776a8b51-6fdc-448e-83bc-55418814a05b) as outcome.status: "ok" with frozenResultText: null.
This separates the problems:
- ACP/operator/model-selection issue: explicit glm-5 → zai/glm-5 without auth (already understood).
- Generic subagent completion/reporting issue: terminal assistant errors can still be stored/announced as successful completion with no frozen result.
Implemented upstream patch on branch fix/subagent-wait-error-outcome in external/openclaw-upstream so subagent completion paths inspect the latest assistant terminal message and treat terminal assistant errors as outcome.status: "error" rather than ok.
Validation completed for targeted non-E2E coverage:
- pnpm -C external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts
- result: passed (50 tests across 3 files).
E2E-style subagent-announce.format.e2e.test.ts coverage was updated but the normal Vitest include rules exclude *.e2e.test.ts; direct pnpm test -- --run ...e2e... confirms exclusion rather than executing that file.
Tried to take over live verification directly in the main session on 2026-03-13:
- confirmed upstream branch fix/subagent-wait-error-outcome is present with commit 2a2ed0d6f
- confirmed normal packaged gateway was healthy before attempting runtime verification
- first direct hot-swap attempt was interrupted at gateway stop time; systemd restored the packaged gateway cleanly
- no patched upstream gateway was left running after that attempt
Current state: upstream patch + targeted tests are real.
Real subagent success verification now completed on gpt-5.4:
- run id: 23750d80-b481-4f50-b219-cc9245be405f
- child session: agent:main:subagent:ad2cc776-2527-4078-ab83-0220dbd09509
- result: successful completion with a real final child result (SUCCESS-PROBE-OK)
A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
- killed/failed run id: 4965775c-4764-41e9-a77a-692f1ab4c2fd
Live failure-path verification on a valid working model/runtime is now complete on gpt-5.4.
- spawned child run: b50cb91f-6219-44f7-9d2f-a1264ac7ceaf
- requester session: agent:main:subagent-reliability-failure-hex-1773425126098
- child session: agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594
- child transcript: ~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl
- terminal child assistant message (transcript line 6) recorded:
  - provider: "openai-codex"
  - model: "gpt-5.4"
  - stopReason: "error"
  - errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"
- matching ~/.openclaw/subagents/runs.json record now correctly persisted:
  - outcome.status: "error"
  - outcome.error: "Codex error: {...context_length_exceeded...}"
  - endedReason: "subagent-error"
  - frozenResultText: "Codex error: {...context_length_exceeded...}"
Important nuance from the same live repro: raw gateway agent.wait still returned {"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881} for that failed child. So the current fix is verified for persisted/announced subagent outcomes, but not for the lower-level agent.wait RPC semantics.
Follow-up code inspection on 2026-03-13 found that the agent.wait mismatch is a real upstream bug, not intentional layering:
- src/agents/pi-embedded-subscribe.handlers.lifecycle.ts already treats terminal assistant stopReason:"error" as lifecycle phase:"error".
- src/gateway/server-methods/agent-wait-dedupe.ts now also interprets resolved agent RPC payloads with result.meta.stopReason:"error" as terminal status:"error" (and aborted:true as timeout).
- but src/commands/agent.ts still had a fallback path that unconditionally emitted lifecycle phase:"end" whenever no inner lifecycle callback was observed, even if the resolved run result carried meta.stopReason:"error".
- because waitForAgentJob gives lifecycle errors a retry grace window, that fallback end could overwrite the earlier failed state and make raw agent.wait resolve status:"ok" for a terminal assistant/provider error.
Implemented the smallest focused upstream fix on branch fix/subagent-wait-error-outcome:
- src/commands/agent.ts now emits lifecycle phase:"error" (with extracted terminal error text) when a resolved run stops with meta.stopReason:"error" and no inner lifecycle callback fired.
- src/commands/agent.test.ts adds coverage for that fallback path.
- src/gateway/server-methods/agent-wait-dedupe.ts + agent-wait-dedupe.test.ts cover the dedupe snapshot path so completed agent RPC payloads with terminal assistant errors/timeouts also map to error/timeout instead of staying ok.
Targeted validation for this follow-up passed:
- pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts
- result: passed (81 tests across 3 files).
Remaining open item: no second live hot-swap/runtime repro was attempted in this pass, so the new agent.wait fix is validated by exact code-path inspection plus focused tests, not yet by another live gateway run.
Side assessment on unrelated dirty upstream work: the /subagents log UX diff in src/auto-reply/reply/commands-subagents/action-log.ts + shared.ts is logically coherent and passed pnpm test -- --run src/auto-reply/reply/commands.test.ts (44 tests), but it is still out-of-scope for this reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.

Constraints

Prefer evidence over theory.
Do not claim a fix without concrete validation.
Keep the main session clean; use this file as the canonical baton.

Success criteria

Clear diagnosis of the current reliability problem(s).
At least one of:
- implemented fix with validation, or
- sharply scoped next fix plan with exact evidence and files.
memory/2026-03-13.md (or current daily note), memory/tasks.json, and this WIP updated.

9.1 KiB Raw Blame History