Files
swarm-zap/HANDOFF.md

5.5 KiB
Raw Permalink Blame History

HANDOFF.md

Purpose

Immediate baton-pass for the next fresh implementation session.

Current objective

Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The subagent persistence/announcement fix and the raw agent.wait semantics fix are now both live-verified on branch fix/subagent-wait-error-outcome; the next work should stay tightly scoped to ACP-specific Claude/Codex follow-up. This pass already narrowed that thread to a real bundled-acpx parser bug for Claude-style JSON-RPC auth failures and landed a focused fix/tests. The remaining work is end-to-end OpenClaw ACP-path validation (or a fresh repro of the older exit-code crash notes) plus normal commit/push/PR cleanup when desired.

Use these state files first

  1. WIP.subagent-reliability.md — canonical state for this pass
  2. memory/tasks.json — task tracking for reliability items
  3. memory/2026-03-04-subagent-delegation.md — earlier delegation context
  4. memory/2026-03-13.md if present, otherwise append todays evidence there
  5. external/openclaw-upstream/ — for any core-runtime fix work
  • task-20260304-2215-subagent-reliability — in progress
  • task-20260304-211216-acp-claude-codex — open

Known truths

  • TUI noise suppression was already patched locally and upstreamed earlier.
  • User still wants actual subagent reliability improved, not just UI noise hidden.
  • Historical ACP notes included Claude: acpx exited with code 1 and Codex: acpx exited with code 5, but those exact crashes were not reproduced in the latest pass.
  • Fresh-session implementation discipline is now the expected approach for non-trivial work.
  • One explicit failure mode is already understood: requesting glm-5 can route into an unavailable GLM-5 provider/entitlement path in this setup.
  • A deeper bug was also identified and fixed earlier: a subagent run could finish with terminal assistant errors yet still be recorded as successful with no frozen result.
  • Current host state for ACP follow-up:
    • bundled plugin-local acpx exists and runs
    • ~/.openclaw/openclaw.json currently has no explicit acp block / enabled acpx plugin entry, so this pass used the smallest direct acpx repro path instead of a full OpenClaw ACP session
  • New confirmed acpx/runtime bug from this pass:
    • Codex direct acpx path works
    • Claude direct acpx path returns top-level JSON-RPC auth errors (Authentication required) and exits 0
    • extensions/acpx/src/runtime-internals/events.ts previously dropped that JSON-RPC error shape during prompt streaming, so OpenClaw could falsely treat the turn as successful
  • A focused upstream fix for that runtime bug now exists on fix/subagent-wait-error-outcome with targeted tests passing.

Highest-priority next actions

  1. Treat the generic reliability fixes as live-verified on this branch:
    • subagent persistence/announcement proof:
      • run id b50cb91f-6219-44f7-9d2f-a1264ac7ceaf
      • child transcript ~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl
      • runs.json stores outcome.status: "error", endedReason: "subagent-error", and a non-null frozenResultText
    • raw agent.wait live-fix proof:
      • gateway launch: OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18903 --bind loopback --auth none --allow-unconfigured
      • run id: gwc-live-agent-wait-gpt53-source-fixed2-1773429512008
      • final agent response: finalStatus:"error"
      • agent.wait: {"runId":"gwc-live-agent-wait-gpt53-source-fixed2-1773429512008","status":"error","endedAt":1773429514106,"error":"LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."}
  2. Treat the ACP follow-up as partially closed, not fully done:
    • live direct bundled-acpx Codex repro now works and returns OK
    • live direct bundled-acpx Claude repro returns JSON-RPC auth errors with process exit 0
    • focused upstream fix now maps top-level JSON-RPC prompt errors into ACP runtime type:"error" events instead of silently dropping them
    • targeted validation passed:
      • cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts
      • result: 22 tests passed across 3 files
  3. Next, do end-to-end OpenClaw ACP validation if/when ACP is explicitly enabled here:
    • confirm or add the needed acp / acpx config in ~/.openclaw/openclaw.json (or equivalent current config path)
    • run the smallest real OpenClaw ACP turn/session and confirm Claude auth failures now surface as terminal errors instead of false success
    • only reopen the old acpx exited with code 1/5 thread if a fresh repro appears
  4. Commit/push/PR the focused upstream reliability branch when ready.
  5. Leave the dirty /subagents log UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed src/auto-reply/reply/commands.test.ts but still lacks dedicated feature coverage.

Success criteria

  • Real-run verification of the new error/outcome fix. done for subagent persistence/announcement handling.
  • Clear separation between resolved reporting bug(s) and any still-open ACP/runtime failures.
  • Explicit decision on whether raw agent.wait behavior is acceptable or requires a follow-up fix.
  • State files updated with paths, commands, and outcomes.