Files
swarm-zap/memory/2026-03-13.md

53 lines
6.2 KiB
Markdown

# 2026-03-13
## Subagent reliability investigation
- Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work.
- Failure mode: delegated run was spawned with model `glm-5`, which resolved to provider model `zai/glm-5`.
- Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`.
- Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for `glm-5`.
- Verified by inspecting agent auth profile keys under:
- `/home/openclaw/.openclaw/agents/*/agent/auth-profiles.json`
- Relevant OpenClaw docs confirm:
- subagent spawns inherit caller model when `sessions_spawn.model` is omitted
- provider/model auth errors like `No API key found for provider "zai"` occur when a provider model is selected without matching auth
- multi-agent auth is per-agent via `~/.openclaw/agents/<agentId>/agent/auth-profiles.json`
- Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents.
- Corrective action: retry fresh delegation with `litellm/glm-5` (the intended medium-tier routed model for delegated implementation work in this setup).
- Will explicitly requested on 2026-03-13 to use `gpt-5.4` for subagents for now while debugging delegation reliability.
- New evidence from the corrected run: `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` shows repeated assistant `stopReason:"error"` entries with `429 ... GLM-5 not included in current subscription plan`, but `~/.openclaw/subagents/runs.json` recorded run `776a8b51-6fdc-448e-83bc-55418814a05b` as `outcome.status: "ok"` and `frozenResultText: null`.
- That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result.
- Implemented upstream fix on branch `external/openclaw-upstream@fix/subagent-wait-error-outcome`:
- added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text
- subagent registry now downgrades `agent.wait => ok` to `error` when the child session's terminal assistant message is actually an error
- subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful `(no output)` completions
- Targeted validation passed:
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
- result: `50 tests` passed across `3` files
- Real success-path verification later passed on `gpt-5.4` with run `23750d80-b481-4f50-b219-cc9245be405f` and final child result `SUCCESS-PROBE-OK`.
- Real failure-path verification later also passed on valid `gpt-5.4` by intentionally triggering a `context_length_exceeded` provider error with a token-dense oversized task payload.
- child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
- child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
- child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
- transcript terminal assistant entry recorded `provider:"openai-codex"`, `model:"gpt-5.4"`, `stopReason:"error"`, `errorMessage:"Codex error: {...context_length_exceeded...}"`
- matching `~/.openclaw/subagents/runs.json` now correctly stored:
- `outcome.status: "error"`
- `outcome.error: "Codex error: {...context_length_exceeded...}"`
- `endedReason: "subagent-error"`
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
- Important remaining nuance from the live repro: raw gateway `agent.wait` for that same failed child returned `status:"ok"` with only `endedAt` even though the child transcript terminal assistant message had `stopReason:"error"`.
- Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional `agent.wait` layering choice:
- embedded subscribe lifecycle already emits `phase:"error"` for terminal assistant/provider failures
- but `src/commands/agent.ts` had a fallback lifecycle emitter that still sent `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`
- `waitForAgentJob` gives lifecycle errors a retry grace window, so that fallback `end` could overwrite the terminal failure and make `agent.wait` resolve `ok`
- Implemented focused upstream follow-up on branch `fix/subagent-wait-error-outcome`:
- `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired
- `src/gateway/server-methods/agent-wait-dedupe.ts` now also maps completed agent dedupe payloads with `result.meta.stopReason:"error"` to `status:"error"` and `aborted:true` to `status:"timeout"`
- Targeted validation passed:
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
- result: `81 tests` passed across `3` files
- Live runtime verification of the new `agent.wait` fix was not re-run in this pass; current evidence is exact code-path inspection plus focused tests.
- Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
- Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
- Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
- Will explicitly asked on 2026-03-13 for more frequent checks on active subagent runs; zap should inspect/steer sooner instead of waiting for long silent stretches.