docs: document native audio support across README, CHANGELOG, config, and planning docs

- README: add audio.transcribe to tool list, update media pipeline description,
  add Native Audio Support and Audio Transcription config sections, add
  supports_audio per-tier override example
- SOUL.md: add audio.transcribe to available tools list
- CHANGELOG: add native audio support and audio.transcribe tool entries
- config/default.yaml: add commented audio config section, supports_audio hint
- INTEGRATIONS.md: expand audio section with native passthrough, capabilities,
  smart routing, AudioSource type, token estimation, audio.transcribe tool
- STRUCTURE.md: add capabilities.ts and audio-transcribe.ts to key file listings
- ARCHITECTURE.md: update data flow step 5 to describe smart audio routing
This commit is contained in:
William Valentin
2026-02-11 18:41:53 -08:00
parent 819ac26b3b
commit 5c531a760d
7 changed files with 87 additions and 8 deletions
+1 -1
View File
@@ -161,7 +161,7 @@
2. Adapter calls `onMessage()` callback → `ChannelRegistry.handleInbound()` routes to `MessageHandler`
3. `createMessageRouter()` resolves agent config via `AgentRouter.resolve(channel, senderId)`
4. `getOrCreateAgent()` creates/retrieves `AgentOrchestrator` for the session (cached by `channel:sender:agentConfig`)
5. Audio attachments transcribed if present
5. Audio routing: `supportsAudioInput()` checks provider capability — native audio passed through for Gemini/OpenAI/GitHub, transcribed via Whisper for others
6. `orchestrator.process()` → injects memory context → checks compaction → delegates to `NativeAgent.process()`
7. `NativeAgent.toolLoop()` → sends to `ModelRouter.chat()` → model returns response or tool calls
8. If tool calls: `ToolExecutor.execute()` → policy check → hook check → tool execution → loop back to model
+16
View File
@@ -234,6 +234,22 @@ All adapters implement `ChannelAdapter` interface (`src/channels/types.ts`): `co
- Supported formats: OGG, MP3, WAV, WebM, MP4, M4A
- Integration: Auto-transcribes audio attachments from channels before model processing
**Native Audio Passthrough:**
- Implementation: `src/models/capabilities.ts`, `src/daemon/routing.ts`
- Capability check: `supportsAudioInput(provider, model, override?)` determines if a model can process raw audio
- Audio-capable providers: Gemini (`inlineData`), OpenAI (`input_audio`), GitHub (`input_audio`)
- Non-audio providers: Anthropic, Bedrock, Ollama, llama.cpp (fall back to Whisper transcription)
- Config override: `supports_audio: true/false` per model tier overrides auto-detection
- Smart routing: `createMessageRouter()` checks capability, passes raw `AudioSource` for capable models or transcribes via Whisper for others
- Audio content types: `AudioSource` (`{ type: 'audio', data: string, mimeType: string }`) in `src/models/types.ts`
- Token estimation: `estimateAudioTokens()` in `src/context/tokens.ts` (base64 length -> bytes -> duration at 16kbps -> tokens at 32/sec)
**Agent Tool: audio.transcribe:**
- Implementation: `src/tools/builtin/audio-transcribe.ts`
- Transcribes audio files on-demand via the configured Whisper-compatible endpoint
- Input: file path or base64 data with MIME type
- Output: transcribed text
## MCP (Model Context Protocol)
**MCP Client:**
+2 -2
View File
@@ -150,7 +150,7 @@ flynn/
**`src/models/`:**
- Purpose: LLM provider client implementations and tier-based routing
- Contains: Provider clients, `ModelRouter`, retry logic, cost estimation, media helpers
- Key files: `src/models/types.ts` (core interfaces), `src/models/router.ts`, `src/models/anthropic.ts`, `src/models/openai.ts`, `src/models/gemini.ts`, `src/models/bedrock.ts`, `src/models/github.ts`, `src/models/retry.ts`, `src/models/costs.ts`, `src/models/media.ts`
- Key files: `src/models/types.ts` (core interfaces), `src/models/router.ts`, `src/models/anthropic.ts`, `src/models/openai.ts`, `src/models/gemini.ts`, `src/models/bedrock.ts`, `src/models/github.ts`, `src/models/retry.ts`, `src/models/costs.ts`, `src/models/media.ts`, `src/models/capabilities.ts`
**`src/models/local/`:**
- Purpose: Local model provider clients
@@ -185,7 +185,7 @@ flynn/
**`src/tools/builtin/`:**
- Purpose: Built-in tool implementations shipped with Flynn
- Contains: Shell exec, file operations, web fetch, memory ops, web search, media send, image analysis, session management, agent listing, cross-channel messaging, cron management
- Key files: `src/tools/builtin/shell.ts`, `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-patch.ts`, `src/tools/builtin/file-list.ts`, `src/tools/builtin/web-fetch.ts`, `src/tools/builtin/web-search.ts`, `src/tools/builtin/memory-read.ts`, `src/tools/builtin/memory-write.ts`, `src/tools/builtin/memory-search.ts`, `src/tools/builtin/media-send.ts`, `src/tools/builtin/image-analyze.ts`, `src/tools/builtin/system-info.ts`, `src/tools/builtin/sessions.ts`, `src/tools/builtin/agents-list.ts`, `src/tools/builtin/message-send.ts`, `src/tools/builtin/cron.ts`
- Key files: `src/tools/builtin/shell.ts`, `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-patch.ts`, `src/tools/builtin/file-list.ts`, `src/tools/builtin/web-fetch.ts`, `src/tools/builtin/web-search.ts`, `src/tools/builtin/memory-read.ts`, `src/tools/builtin/memory-write.ts`, `src/tools/builtin/memory-search.ts`, `src/tools/builtin/media-send.ts`, `src/tools/builtin/image-analyze.ts`, `src/tools/builtin/audio-transcribe.ts`, `src/tools/builtin/system-info.ts`, `src/tools/builtin/sessions.ts`, `src/tools/builtin/agents-list.ts`, `src/tools/builtin/message-send.ts`, `src/tools/builtin/cron.ts`
**`src/tools/builtin/browser/`:**
- Purpose: Puppeteer-based browser automation tools