WIP.subagent-reliability.md

Status

Status: open Owner: zap Opened: 2026-03-13

Purpose

Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.

Why now

This is the highest-leverage remaining open reliability item because it affects trust in delegation and the usability of fresh implementation runs.

task-20260304-2215-subagent-reliability — in progress
task-20260304-211216-acp-claude-codex — open

Known context

Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
Upstream patch exists in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547.
User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
Fresh-session implementation discipline and monitoring thresholds were already documented locally.

Goals for this pass

Establish the current failure modes with concrete evidence.
Separate ACP-specific failures from generic subagent/session issues.
Determine what is already fixed versus still broken.
Produce a concrete recommendation and, if feasible in one pass, implement the highest-confidence fix.
Update task/memory state with evidence before ending.

Suggested investigation plan

Review current OpenClaw docs and local memory around subagent/ACP failures.
Reproduce or inspect recent failures using session/task evidence instead of guessing.
Check current runtime status / relevant logs / known local patches.
If the issue is in OpenClaw core, work in external/openclaw-upstream/ on a focused branch.
Validate with the smallest reliable reproduction possible.

Evidence gathered so far

Fresh subagent run failed immediately with provider auth error for zai before any task execution.
Current installed agent auth profiles include openai-codex:default, litellm:default, and github-copilot:github; there is no zai profile configured.
Root cause for this immediate repro appears to be an incorrect explicit spawn model choice (glm-5 alias → zai/glm-5) rather than missing auth propagation between agents.
Next step after confirming the model-selection issue: prefer gpt-5.4 for fresh subagent reliability/debug passes for now, per Will's instruction, and continue separating real runtime issues from operator/config mistakes.

Constraints

Prefer evidence over theory.
Do not claim a fix without concrete validation.
Keep the main session clean; use this file as the canonical baton.

Success criteria

Clear diagnosis of the current reliability problem(s).
At least one of:
- implemented fix with validation, or
- sharply scoped next fix plan with exact evidence and files.
memory/2026-03-13.md (or current daily note), memory/tasks.json, and this WIP updated.

2.9 KiB Raw Blame History