Files
swarm-zap/WIP.subagent-reliability.md

2.9 KiB

WIP.subagent-reliability.md

Status

Status: open Owner: zap Opened: 2026-03-13

Purpose

Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.

Why now

This is the highest-leverage remaining open reliability item because it affects trust in delegation and the usability of fresh implementation runs.

  • task-20260304-2215-subagent-reliability — in progress
  • task-20260304-211216-acp-claude-codex — open

Known context

  • Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
  • Upstream patch exists in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547.
  • User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
  • Fresh-session implementation discipline and monitoring thresholds were already documented locally.

Goals for this pass

  1. Establish the current failure modes with concrete evidence.
  2. Separate ACP-specific failures from generic subagent/session issues.
  3. Determine what is already fixed versus still broken.
  4. Produce a concrete recommendation and, if feasible in one pass, implement the highest-confidence fix.
  5. Update task/memory state with evidence before ending.

Suggested investigation plan

  1. Review current OpenClaw docs and local memory around subagent/ACP failures.
  2. Reproduce or inspect recent failures using session/task evidence instead of guessing.
  3. Check current runtime status / relevant logs / known local patches.
  4. If the issue is in OpenClaw core, work in external/openclaw-upstream/ on a focused branch.
  5. Validate with the smallest reliable reproduction possible.

Evidence gathered so far

  • Fresh subagent run failed immediately with provider auth error for zai before any task execution.
  • Current installed agent auth profiles include openai-codex:default, litellm:default, and github-copilot:github; there is no zai profile configured.
  • Root cause for this immediate repro appears to be an incorrect explicit spawn model choice (glm-5 alias → zai/glm-5) rather than missing auth propagation between agents.
  • Next step after confirming the model-selection issue: prefer gpt-5.4 for fresh subagent reliability/debug passes for now, per Will's instruction, and continue separating real runtime issues from operator/config mistakes.

Constraints

  • Prefer evidence over theory.
  • Do not claim a fix without concrete validation.
  • Keep the main session clean; use this file as the canonical baton.

Success criteria

  • Clear diagnosis of the current reliability problem(s).
  • At least one of:
    • implemented fix with validation, or
    • sharply scoped next fix plan with exact evidence and files.
  • memory/2026-03-13.md (or current daily note), memory/tasks.json, and this WIP updated.