docs(reliability): record acpx follow-up evidence

This commit is contained in:
zap
2026-03-13 20:10:40 +00:00
parent 3cfa7a158c
commit d08d8fe661
4 changed files with 104 additions and 13 deletions

View File

@@ -22,7 +22,7 @@ Investigate and improve subagent / ACP delegation reliability, including timeout
## Why this file is still open
- The broader delegation reliability task is not fully done yet.
- Remaining follow-up work is now narrower:
1. ACP-specific Claude/Codex runtime failures
1. ACP-specific Claude/Codex runtime failures / final live OpenClaw ACP validation
2. optional separate `/subagents log` UX cleanup
3. push/PR the focused upstream reliability branch when desired
@@ -38,7 +38,8 @@ Investigate and improve subagent / ACP delegation reliability, including timeout
## Immediate baton
- Do **not** reopen the solved `agent.wait` investigation unless a fresh repro appears.
- If this project is resumed next, start with ACP-specific Claude/Codex runtime failures.
- If this project is resumed next, start with **real OpenClaw ACP-path validation** of the new acpx JSON-RPC error handling (or capture a fresh Claude/Codex end-to-end repro if ACP still is not configured here).
- Treat the historical `acpx exited with code 1/5` note as unresolved-but-unreproduced; do not spend more time on it without fresh evidence.
- Treat `/subagents log` UX edits as a separate branch/pass so they do not muddy the reliability fix branch.
## Evidence gathered so far
@@ -141,6 +142,39 @@ Investigate and improve subagent / ACP delegation reliability, including timeout
- subagent persistence/announcement fix: live-verified ✅
- raw `agent.wait` semantics fix: live-verified ✅
- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this focused reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
- ACP follow-up pass on 2026-03-13 found a **new live-reproducible runtime bug** in the bundled `extensions/acpx` layer:
- current host state does **not** expose a global `acpx` binary on PATH, but the bundled plugin-local runtime exists and works at `~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx`
- current `~/.openclaw/openclaw.json` does not contain an explicit `acp` block or enabled `acpx` plugin entry, so this pass used the smallest direct runtime repro path instead of a full `sessions_spawn(runtime:"acp")` OpenClaw run
- live direct Codex repro now succeeds:
- command: bundled `acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'`
- result: clean JSON-RPC/session stream ending with `agent_message_chunk: "OK"`, `id:2 result:{stopReason:"end_turn"}`, process `exit=0`
- live direct Claude repro does **not** crash, but returns top-level JSON-RPC auth errors and still exits 0:
- command: bundled `acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'`
- stdout included:
- `{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}`
- `{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}`
- process `exit=0`
- source inspection showed `extensions/acpx/src/runtime-internals/events.ts` ignored that top-level JSON-RPC error shape during prompt streaming, so `runtime.runTurn()` could silently treat Claude auth failure as success (`done`) when no typed `error` event or non-zero exit was emitted
- Implemented the smallest focused upstream runtime fix on branch `fix/subagent-wait-error-outcome`:
- `extensions/acpx/src/runtime-internals/events.ts`
- `toAcpxErrorEvent()` now recognizes top-level JSON-RPC `error` responses via `parseControlJsonError()`
- `parsePromptEventLine()` now maps those JSON-RPC errors into ACP runtime `type:"error"` events instead of dropping them
- regression coverage added:
- `extensions/acpx/src/runtime-internals/events.test.ts` — top-level JSON-RPC prompt error parsing
- `extensions/acpx/src/runtime-internals/test-fixtures.ts` — mock prompt path for clean-exit JSON-RPC auth error
- `extensions/acpx/src/runtime.test.ts``runTurn()` emits error and does **not** emit `done` for the Claude-style auth failure shape
- Targeted validation for the ACP follow-up fix passed:
- `cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
- result: `3` files passed, `22` tests passed
- Current interpretation of the old Claude/Codex ACP bug after this pass:
- historical notes still say `Claude: acpx exited with code 1`, `Codex: acpx exited with code 5`
- those exact exit-code crashes were **not** reproduced today
- current live state is narrower and better understood:
- Codex ACP path works directly
- Claude ACP path currently fails for auth, and OpenClaw previously mishandled that failure shape in the acpx runtime layer
- Remaining open ACP follow-up after this fix:
- validate the patched runtime through the real OpenClaw ACP path (`sessions_spawn(runtime:"acp")`) once ACP is explicitly enabled/configured here, or whenever a fresh end-to-end repro is available
- only reopen the historical `acpx exited with code 1/5` line if a fresh repro appears
## Constraints
- Prefer evidence over theory.