190 lines
18 KiB
Markdown
190 lines
18 KiB
Markdown
# WIP.subagent-reliability.md
|
|
|
|
## Status
|
|
Status: `follow-up`
|
|
Owner: `zap`
|
|
Opened: `2026-03-13`
|
|
Last updated: `2026-03-13`
|
|
|
|
## Purpose
|
|
Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.
|
|
|
|
## Current state
|
|
- The core reliability thread tracked in this WIP is now **fixed and live-verified** on `external/openclaw-upstream` branch `fix/subagent-wait-error-outcome`.
|
|
- Verified fixed:
|
|
- subagent persistence / announcement handling for terminal assistant-provider failures
|
|
- raw `agent.wait` semantics for the live direct gateway path
|
|
- Key upstream commits on this branch:
|
|
- `2a2ed0d6f` — `fix(subagents): derive outcome from terminal assistant errors`
|
|
- `5a328d22b` — `fix(agent): surface terminal run errors in wait semantics`
|
|
- `f9a78e8f7` — `fix(gateway): honor terminal assistant errors in live wait path`
|
|
|
|
## Why this file is still open
|
|
- The broader delegation reliability task is not fully done yet.
|
|
- Remaining follow-up work is now narrower:
|
|
1. ACP-specific Claude/Codex runtime failures / final live OpenClaw ACP validation
|
|
2. optional separate `/subagents log` UX cleanup
|
|
3. push/PR the focused upstream reliability branch when desired
|
|
|
|
## Related tasks
|
|
- `task-20260304-2215-subagent-reliability` — in progress
|
|
- `task-20260304-211216-acp-claude-codex` — open
|
|
|
|
## Known context
|
|
- Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
|
|
- Upstream patch exists in `external/openclaw-upstream` on branch `fix/tui-hide-internal-runtime-context` commit `0f66a4547`.
|
|
- User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
|
|
- Fresh-session implementation discipline and monitoring thresholds were already documented locally.
|
|
|
|
## Immediate baton
|
|
- Do **not** reopen the solved `agent.wait` investigation unless a fresh repro appears.
|
|
- If this project is resumed next, start with **real OpenClaw ACP-path validation** of the new acpx JSON-RPC error handling (or capture a fresh Claude/Codex end-to-end repro if ACP still is not configured here).
|
|
- Treat the historical `acpx exited with code 1/5` note as unresolved-but-unreproduced; do not spend more time on it without fresh evidence.
|
|
- Treat `/subagents log` UX edits as a separate branch/pass so they do not muddy the reliability fix branch.
|
|
|
|
## Evidence gathered so far
|
|
- Fresh subagent run failed immediately when an explicit `glm-5` choice resolved into the Z.AI provider path before any useful task execution.
|
|
- Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`.
|
|
- Will clarified that Z.AI auth does exist, but this account is not entitled for `glm-5`.
|
|
- Root cause for this immediate repro is therefore best described as a provider/model entitlement mismatch caused by the explicit spawn model choice, not missing auth propagation between agents.
|
|
- A later "corrected" run using `litellm/glm-5` also did not succeed: child transcript `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` contains repeated assistant `stopReason:"error"` entries with `429 ... subscription plan does not yet include access to GLM-5`, while `~/.openclaw/subagents/runs.json` recorded that run (`776a8b51-6fdc-448e-83bc-55418814a05b`) as `outcome.status: "ok"` with `frozenResultText: null`.
|
|
- This separates the problems:
|
|
- ACP/operator/model-selection issue: explicit `glm-5` → `zai/glm-5` without auth (already understood).
|
|
- Generic subagent completion/reporting issue: terminal assistant errors can still be stored/announced as successful completion with no frozen result.
|
|
- Implemented upstream patch on branch `fix/subagent-wait-error-outcome` in `external/openclaw-upstream` so subagent completion paths inspect the latest assistant terminal message and treat terminal assistant errors as `outcome.status: "error"` rather than `ok`.
|
|
- Validation completed for targeted non-E2E coverage:
|
|
- `pnpm -C external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
|
|
- result: passed (`50 tests` across `3` files).
|
|
- E2E-style `subagent-announce.format.e2e.test.ts` coverage was updated but the normal Vitest include rules exclude `*.e2e.test.ts`; direct `pnpm test -- --run ...e2e...` confirms exclusion rather than executing that file.
|
|
- Tried to take over live verification directly in the main session on 2026-03-13:
|
|
- confirmed upstream branch `fix/subagent-wait-error-outcome` is present with commit `2a2ed0d6f`
|
|
- confirmed normal packaged gateway was healthy before attempting runtime verification
|
|
- first direct hot-swap attempt was interrupted at gateway stop time; systemd restored the packaged gateway cleanly
|
|
- no patched upstream gateway was left running after that attempt
|
|
- Current state: upstream patch + targeted tests are real.
|
|
- Real subagent success verification now completed on `gpt-5.4`:
|
|
- run id: `23750d80-b481-4f50-b219-cc9245be405f`
|
|
- child session: `agent:main:subagent:ad2cc776-2527-4078-ab83-0220dbd09509`
|
|
- result: successful completion with a real final child result (`SUCCESS-PROBE-OK`)
|
|
- A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
|
|
- killed/failed run id: `4965775c-4764-41e9-a77a-692f1ab4c2fd`
|
|
- Live failure-path verification on a valid working model/runtime is now complete on `gpt-5.4`.
|
|
- spawned child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
|
- requester session: `agent:main:subagent-reliability-failure-hex-1773425126098`
|
|
- child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
|
|
- child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
|
- terminal child assistant message (transcript line 6) recorded:
|
|
- `provider: "openai-codex"`
|
|
- `model: "gpt-5.4"`
|
|
- `stopReason: "error"`
|
|
- `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
|
|
- matching `~/.openclaw/subagents/runs.json` record now correctly persisted:
|
|
- `outcome.status: "error"`
|
|
- `outcome.error: "Codex error: {...context_length_exceeded...}"`
|
|
- `endedReason: "subagent-error"`
|
|
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
|
- Important nuance from the same live repro: raw gateway `agent.wait` still returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}` for that failed child. So the current fix is verified for persisted/announced **subagent outcomes**, but **not** for the lower-level `agent.wait` RPC semantics.
|
|
- Follow-up code inspection on 2026-03-13 found that the `agent.wait` mismatch is a real upstream bug, not intentional layering:
|
|
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts` already treats terminal assistant `stopReason:"error"` as lifecycle `phase:"error"`.
|
|
- `src/gateway/server-methods/agent-wait-dedupe.ts` now also interprets resolved agent RPC payloads with `result.meta.stopReason:"error"` as terminal `status:"error"` (and `aborted:true` as `timeout`).
|
|
- but `src/commands/agent.ts` still had a fallback path that unconditionally emitted lifecycle `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`.
|
|
- because `waitForAgentJob` gives lifecycle errors a retry grace window, that fallback `end` could overwrite the earlier failed state and make raw `agent.wait` resolve `status:"ok"` for a terminal assistant/provider error.
|
|
- Implemented the smallest focused upstream fix on branch `fix/subagent-wait-error-outcome`:
|
|
- `src/commands/agent.ts` now emits lifecycle `phase:"error"` (with extracted terminal error text) when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired.
|
|
- `src/commands/agent.test.ts` adds coverage for that fallback path.
|
|
- `src/gateway/server-methods/agent-wait-dedupe.ts` + `agent-wait-dedupe.test.ts` cover the dedupe snapshot path so completed agent RPC payloads with terminal assistant errors/timeouts also map to `error`/`timeout` instead of staying `ok`.
|
|
- Targeted validation for this follow-up passed:
|
|
- `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
|
- result: passed (`81 tests` across `3` files).
|
|
- Follow-up live runtime verification on 2026-03-13 showed the current `agent.wait` fix did **not** close the live path yet.
|
|
- patched gateway launched directly from source on loopback with channels skipped:
|
|
- command: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured`
|
|
- log evidence: `2026-03-13T18:52:10.743+00:00 [gateway] agent model: openai-codex/gpt-5.3-codex`
|
|
- live repro used a fresh default-model session and an oversized in-memory payload over `GatewayClient` (not CLI argv):
|
|
- session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586`
|
|
- run id: `gwc-live-agent-wait-gpt53-source-1773427981614`
|
|
- payload chars: `880150`
|
|
- start result: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}`
|
|
- wait result: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}`
|
|
- same session's terminal assistant message still recorded a real provider failure:
|
|
- `provider: "openai-codex"`
|
|
- `model: "gpt-5.3-codex"`
|
|
- `stopReason: "error"`
|
|
- `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
|
|
- earlier temporary gateway runs reinforced the same mismatch:
|
|
- stale dist gateway repro run `gwc-live-agent-wait-gpt53-1773427893583` also returned `status:"ok"` while transcript stopReason remained `error`
|
|
- temp `gpt-5.4` session repro on the same temp gateway returned `status:"error"`, but only because that runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; that is useful as transport sanity, but **not** the canonical live semantics proof
|
|
- The final focused live-fix pass on 2026-03-13 closed the remaining `agent.wait` bug.
|
|
- root cause confirmed: the live direct gateway path could receive an inner `agent_end` event carrying a terminal assistant error without a preceding `message_end`, which left stale/empty assistant state and still emitted lifecycle `phase:"end"`
|
|
- upstream fix extends the embedded subscribe lifecycle handler to recover the terminal assistant from `agent_end.messages` or the session transcript when state is stale, then emit lifecycle `phase:"error"` with a friendly error string instead of `end`
|
|
- upstream fix also updates the direct gateway `agent` RPC handler to observe lifecycle events for the run and derive the final RPC payload/terminal status from observed lifecycle + resolved result metadata, instead of blindly caching `status:"ok"` when the outer RPC resolves
|
|
- files changed for the final fix:
|
|
- `src/agents/pi-embedded-subscribe.e2e-harness.ts`
|
|
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts`
|
|
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts`
|
|
- `src/agents/pi-embedded-subscribe.handlers.ts`
|
|
- `src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts`
|
|
- `src/gateway/server-methods/agent.ts`
|
|
- `src/gateway/server-methods/server-methods.test.ts`
|
|
- Final targeted validation passed:
|
|
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
|
- result: `108 tests` passed across `5` files
|
|
- Final decisive live source-gateway repro after the fix:
|
|
- gateway launch: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18903 --bind loopback --auth none --allow-unconfigured`
|
|
- run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008`
|
|
- session key: `agent:main:subagent:agent-wait-gpt53-live-source-fixed2-1773429512008`
|
|
- final `agent` response with `expectFinal: true` returned:
|
|
- `finalStatus: "error"`
|
|
- `finalSummary: "LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."`
|
|
- matching `agent.wait` returned:
|
|
- `{"runId":"gwc-live-agent-wait-gpt53-source-fixed2-1773429512008","status":"error","endedAt":1773429514106,"error":"LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."}`
|
|
- Net status now:
|
|
- subagent persistence/announcement fix: live-verified ✅
|
|
- raw `agent.wait` semantics fix: live-verified ✅
|
|
- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this focused reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
|
|
- ACP follow-up pass on 2026-03-13 found a **new live-reproducible runtime bug** in the bundled `extensions/acpx` layer:
|
|
- current host state does **not** expose a global `acpx` binary on PATH, but the bundled plugin-local runtime exists and works at `~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx`
|
|
- current `~/.openclaw/openclaw.json` does not contain an explicit `acp` block or enabled `acpx` plugin entry, so this pass used the smallest direct runtime repro path instead of a full `sessions_spawn(runtime:"acp")` OpenClaw run
|
|
- live direct Codex repro now succeeds:
|
|
- command: bundled `acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'`
|
|
- result: clean JSON-RPC/session stream ending with `agent_message_chunk: "OK"`, `id:2 result:{stopReason:"end_turn"}`, process `exit=0`
|
|
- live direct Claude repro does **not** crash, but returns top-level JSON-RPC auth errors and still exits 0:
|
|
- command: bundled `acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'`
|
|
- stdout included:
|
|
- `{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}`
|
|
- `{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}`
|
|
- process `exit=0`
|
|
- source inspection showed `extensions/acpx/src/runtime-internals/events.ts` ignored that top-level JSON-RPC error shape during prompt streaming, so `runtime.runTurn()` could silently treat Claude auth failure as success (`done`) when no typed `error` event or non-zero exit was emitted
|
|
- Implemented the smallest focused upstream runtime fix on branch `fix/subagent-wait-error-outcome`:
|
|
- `extensions/acpx/src/runtime-internals/events.ts`
|
|
- `toAcpxErrorEvent()` now recognizes top-level JSON-RPC `error` responses via `parseControlJsonError()`
|
|
- `parsePromptEventLine()` now maps those JSON-RPC errors into ACP runtime `type:"error"` events instead of dropping them
|
|
- regression coverage added:
|
|
- `extensions/acpx/src/runtime-internals/events.test.ts` — top-level JSON-RPC prompt error parsing
|
|
- `extensions/acpx/src/runtime-internals/test-fixtures.ts` — mock prompt path for clean-exit JSON-RPC auth error
|
|
- `extensions/acpx/src/runtime.test.ts` — `runTurn()` emits error and does **not** emit `done` for the Claude-style auth failure shape
|
|
- Targeted validation for the ACP follow-up fix passed:
|
|
- `cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
|
|
- result: `3` files passed, `22` tests passed
|
|
- Current interpretation of the old Claude/Codex ACP bug after this pass:
|
|
- historical notes still say `Claude: acpx exited with code 1`, `Codex: acpx exited with code 5`
|
|
- those exact exit-code crashes were **not** reproduced today
|
|
- current live state is narrower and better understood:
|
|
- Codex ACP path works directly
|
|
- Claude ACP path currently fails for auth, and OpenClaw previously mishandled that failure shape in the acpx runtime layer
|
|
- Remaining open ACP follow-up after this fix:
|
|
- validate the patched runtime through the real OpenClaw ACP path (`sessions_spawn(runtime:"acp")`) once ACP is explicitly enabled/configured here, or whenever a fresh end-to-end repro is available
|
|
- only reopen the historical `acpx exited with code 1/5` line if a fresh repro appears
|
|
|
|
## Constraints
|
|
- Prefer evidence over theory.
|
|
- Do not claim a fix without concrete validation.
|
|
- Keep the main session clean; use this file as the canonical baton.
|
|
|
|
## Success criteria
|
|
- Clear diagnosis of the current reliability problem(s).
|
|
- At least one of:
|
|
- implemented fix with validation, or
|
|
- sharply scoped next fix plan with exact evidence and files.
|
|
- `memory/2026-03-13.md` (or current daily note), `memory/tasks.json`, and this WIP updated.
|