docs: document native audio support across README, CHANGELOG, config, and planning docs

- README: add audio.transcribe to tool list, update media pipeline description,
  add Native Audio Support and Audio Transcription config sections, add
  supports_audio per-tier override example
- SOUL.md: add audio.transcribe to available tools list
- CHANGELOG: add native audio support and audio.transcribe tool entries
- config/default.yaml: add commented audio config section, supports_audio hint
- INTEGRATIONS.md: expand audio section with native passthrough, capabilities,
  smart routing, AudioSource type, token estimation, audio.transcribe tool
- STRUCTURE.md: add capabilities.ts and audio-transcribe.ts to key file listings
- ARCHITECTURE.md: update data flow step 5 to describe smart audio routing
This commit is contained in:
William Valentin
2026-02-11 18:41:53 -08:00
parent 819ac26b3b
commit 5c531a760d
7 changed files with 87 additions and 8 deletions
+1 -1
View File
@@ -161,7 +161,7 @@
2. Adapter calls `onMessage()` callback → `ChannelRegistry.handleInbound()` routes to `MessageHandler` 2. Adapter calls `onMessage()` callback → `ChannelRegistry.handleInbound()` routes to `MessageHandler`
3. `createMessageRouter()` resolves agent config via `AgentRouter.resolve(channel, senderId)` 3. `createMessageRouter()` resolves agent config via `AgentRouter.resolve(channel, senderId)`
4. `getOrCreateAgent()` creates/retrieves `AgentOrchestrator` for the session (cached by `channel:sender:agentConfig`) 4. `getOrCreateAgent()` creates/retrieves `AgentOrchestrator` for the session (cached by `channel:sender:agentConfig`)
5. Audio attachments transcribed if present 5. Audio routing: `supportsAudioInput()` checks provider capability — native audio passed through for Gemini/OpenAI/GitHub, transcribed via Whisper for others
6. `orchestrator.process()` → injects memory context → checks compaction → delegates to `NativeAgent.process()` 6. `orchestrator.process()` → injects memory context → checks compaction → delegates to `NativeAgent.process()`
7. `NativeAgent.toolLoop()` → sends to `ModelRouter.chat()` → model returns response or tool calls 7. `NativeAgent.toolLoop()` → sends to `ModelRouter.chat()` → model returns response or tool calls
8. If tool calls: `ToolExecutor.execute()` → policy check → hook check → tool execution → loop back to model 8. If tool calls: `ToolExecutor.execute()` → policy check → hook check → tool execution → loop back to model
+16
View File
@@ -234,6 +234,22 @@ All adapters implement `ChannelAdapter` interface (`src/channels/types.ts`): `co
- Supported formats: OGG, MP3, WAV, WebM, MP4, M4A - Supported formats: OGG, MP3, WAV, WebM, MP4, M4A
- Integration: Auto-transcribes audio attachments from channels before model processing - Integration: Auto-transcribes audio attachments from channels before model processing
**Native Audio Passthrough:**
- Implementation: `src/models/capabilities.ts`, `src/daemon/routing.ts`
- Capability check: `supportsAudioInput(provider, model, override?)` determines if a model can process raw audio
- Audio-capable providers: Gemini (`inlineData`), OpenAI (`input_audio`), GitHub (`input_audio`)
- Non-audio providers: Anthropic, Bedrock, Ollama, llama.cpp (fall back to Whisper transcription)
- Config override: `supports_audio: true/false` per model tier overrides auto-detection
- Smart routing: `createMessageRouter()` checks capability, passes raw `AudioSource` for capable models or transcribes via Whisper for others
- Audio content types: `AudioSource` (`{ type: 'audio', data: string, mimeType: string }`) in `src/models/types.ts`
- Token estimation: `estimateAudioTokens()` in `src/context/tokens.ts` (base64 length -> bytes -> duration at 16kbps -> tokens at 32/sec)
**Agent Tool: audio.transcribe:**
- Implementation: `src/tools/builtin/audio-transcribe.ts`
- Transcribes audio files on-demand via the configured Whisper-compatible endpoint
- Input: file path or base64 data with MIME type
- Output: transcribed text
## MCP (Model Context Protocol) ## MCP (Model Context Protocol)
**MCP Client:** **MCP Client:**
+2 -2
View File
@@ -150,7 +150,7 @@ flynn/
**`src/models/`:** **`src/models/`:**
- Purpose: LLM provider client implementations and tier-based routing - Purpose: LLM provider client implementations and tier-based routing
- Contains: Provider clients, `ModelRouter`, retry logic, cost estimation, media helpers - Contains: Provider clients, `ModelRouter`, retry logic, cost estimation, media helpers
- Key files: `src/models/types.ts` (core interfaces), `src/models/router.ts`, `src/models/anthropic.ts`, `src/models/openai.ts`, `src/models/gemini.ts`, `src/models/bedrock.ts`, `src/models/github.ts`, `src/models/retry.ts`, `src/models/costs.ts`, `src/models/media.ts` - Key files: `src/models/types.ts` (core interfaces), `src/models/router.ts`, `src/models/anthropic.ts`, `src/models/openai.ts`, `src/models/gemini.ts`, `src/models/bedrock.ts`, `src/models/github.ts`, `src/models/retry.ts`, `src/models/costs.ts`, `src/models/media.ts`, `src/models/capabilities.ts`
**`src/models/local/`:** **`src/models/local/`:**
- Purpose: Local model provider clients - Purpose: Local model provider clients
@@ -185,7 +185,7 @@ flynn/
**`src/tools/builtin/`:** **`src/tools/builtin/`:**
- Purpose: Built-in tool implementations shipped with Flynn - Purpose: Built-in tool implementations shipped with Flynn
- Contains: Shell exec, file operations, web fetch, memory ops, web search, media send, image analysis, session management, agent listing, cross-channel messaging, cron management - Contains: Shell exec, file operations, web fetch, memory ops, web search, media send, image analysis, session management, agent listing, cross-channel messaging, cron management
- Key files: `src/tools/builtin/shell.ts`, `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-patch.ts`, `src/tools/builtin/file-list.ts`, `src/tools/builtin/web-fetch.ts`, `src/tools/builtin/web-search.ts`, `src/tools/builtin/memory-read.ts`, `src/tools/builtin/memory-write.ts`, `src/tools/builtin/memory-search.ts`, `src/tools/builtin/media-send.ts`, `src/tools/builtin/image-analyze.ts`, `src/tools/builtin/system-info.ts`, `src/tools/builtin/sessions.ts`, `src/tools/builtin/agents-list.ts`, `src/tools/builtin/message-send.ts`, `src/tools/builtin/cron.ts` - Key files: `src/tools/builtin/shell.ts`, `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-patch.ts`, `src/tools/builtin/file-list.ts`, `src/tools/builtin/web-fetch.ts`, `src/tools/builtin/web-search.ts`, `src/tools/builtin/memory-read.ts`, `src/tools/builtin/memory-write.ts`, `src/tools/builtin/memory-search.ts`, `src/tools/builtin/media-send.ts`, `src/tools/builtin/image-analyze.ts`, `src/tools/builtin/audio-transcribe.ts`, `src/tools/builtin/system-info.ts`, `src/tools/builtin/sessions.ts`, `src/tools/builtin/agents-list.ts`, `src/tools/builtin/message-send.ts`, `src/tools/builtin/cron.ts`
**`src/tools/builtin/browser/`:** **`src/tools/builtin/browser/`:**
- Purpose: Puppeteer-based browser automation tools - Purpose: Puppeteer-based browser automation tools
+9
View File
@@ -6,6 +6,15 @@ All notable changes to Flynn are documented in this file.
### Added ### Added
- **Native Audio Support** -- Smart routing for voice messages: audio-capable models
(Gemini, OpenAI, GitHub) receive raw audio directly via `AudioSource` content parts;
non-audio models (Anthropic, Bedrock, Ollama, llama.cpp) get Whisper transcription
fallback. `supportsAudioInput()` capability check with per-model `supports_audio`
config override. Audio token estimation (base64 -> bytes -> duration -> tokens at
32 tokens/sec). 38 new tests (18 capabilities + 15 media + 5 token estimation).
- **Agent Tool: audio.transcribe** -- Transcribe audio files via a Whisper-compatible
API endpoint. Configurable via `audio.transcription_endpoint`, supports OGG, MP3,
WAV, WebM, MP4, M4A formats.
- **xAI (Grok) Provider** -- xAI as OpenAI-compatible model provider with - **xAI (Grok) Provider** -- xAI as OpenAI-compatible model provider with
`provider: xai` config. Supports grok-3, grok-3-mini, grok-2, grok-2-mini, `provider: xai` config. Supports grok-3, grok-3-mini, grok-2, grok-2-mini,
grok-3-fast. Uses `XAI_API_KEY` env var or config `api_key`. grok-3-fast. Uses `XAI_API_KEY` env var or config `api_key`.
+46 -4
View File
@@ -12,10 +12,10 @@ Self-hosted personal AI assistant with Telegram and Terminal interfaces.
- **Session Persistence**: SQLite-backed conversation history - **Session Persistence**: SQLite-backed conversation history
- **Fallback Chains**: Automatic failover when primary model fails - **Fallback Chains**: Automatic failover when primary model fails
- **Hook Engine**: Confirmation system for sensitive operations - **Hook Engine**: Confirmation system for sensitive operations
- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, system info - **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, audio transcribe, system info
- **Docker Sandboxing**: Per-session container isolation for tool execution - **Docker Sandboxing**: Per-session container isolation for tool execution
- **Multi-Agent Routing**: Config-driven agent selection per sender/channel with tool profiles - **Multi-Agent Routing**: Config-driven agent selection per sender/channel with tool profiles
- **Media Pipeline**: Image analysis, outbound attachments, audio transcription across all channels - **Media Pipeline**: Image analysis, outbound attachments, audio transcription and native audio passthrough across all channels
- **Session Transfer**: Move conversations between frontends - **Session Transfer**: Move conversations between frontends
- **CLI**: Full command-line interface (`flynn start`, `send`, `doctor`, `completion`, etc.) - **CLI**: Full command-line interface (`flynn start`, `send`, `doctor`, `completion`, etc.)
- **Shell Completion**: Auto-generated completions for bash, zsh, and fish with `--install` flag - **Shell Completion**: Auto-generated completions for bash, zsh, and fish with `--install` flag
@@ -143,6 +143,48 @@ models:
local: { provider: ollama, model: qwen2.5:14b } local: { provider: ollama, model: qwen2.5:14b }
``` ```
### Native Audio Support
Voice messages from channels can be handled in two ways:
1. **Native passthrough** -- Audio sent directly to models that support audio input (Gemini, OpenAI, GitHub). No transcription step needed.
2. **Whisper transcription** -- Audio transcribed to text via a Whisper-compatible API, then sent as text to models that don't support audio input (Anthropic, Bedrock, Ollama, llama.cpp).
Flynn automatically routes based on the model's capabilities. You can override this per-tier:
```yaml
models:
default:
provider: gemini
model: gemini-2.0-flash
supports_audio: true # Force native audio (auto-detected for known providers)
fast:
provider: anthropic
model: claude-sonnet-4
supports_audio: false # Force transcription (default for Anthropic)
```
### Audio Transcription
Configure a Whisper-compatible endpoint for models that don't support native audio:
```yaml
audio:
transcription_endpoint: "http://localhost:8080/v1/audio/transcriptions"
transcription_api_key: "${WHISPER_API_KEY}" # Optional Bearer token
transcription_model: "whisper-1" # Model name (default: whisper-1)
transcription_provider: "openai" # Provider format: openai (default)
```
| Field | Required | Description |
|-------|----------|-------------|
| `transcription_endpoint` | yes | Whisper-compatible API endpoint |
| `transcription_api_key` | no | Bearer token for authentication |
| `transcription_model` | no | Model name sent in the request (default: `whisper-1`) |
| `transcription_provider` | no | API format: `openai` (default) |
Without an `audio` config, voice messages from non-audio-capable models are silently skipped.
## Telegram Commands ## Telegram Commands
| Command | Description | | Command | Description |
@@ -726,12 +768,12 @@ src/
├── hooks/ # Confirmation engine ├── hooks/ # Confirmation engine
├── mcp/ # MCP tool server integration ├── mcp/ # MCP tool server integration
├── memory/ # Persistent memory store + vector search ├── memory/ # Persistent memory store + vector search
├── models/ # Model providers + router + media pipeline ├── models/ # Model providers + router + media pipeline + audio capabilities
├── prompt/ # System prompt templating (auto-injects current date/time) ├── prompt/ # System prompt templating (auto-injects current date/time)
├── sandbox/ # Docker sandboxing ├── sandbox/ # Docker sandboxing
├── session/ # SQLite persistence ├── session/ # SQLite persistence
├── skills/ # Skill packages ├── skills/ # Skill packages
├── tools/ # Builtin tools (shell, file, web, browser, process, media, system.info) ├── tools/ # Builtin tools (shell, file, web, browser, process, media, audio, system.info)
└── automation/ # Cron scheduler, webhooks, heartbeat monitor, Gmail watcher └── automation/ # Cron scheduler, webhooks, heartbeat monitor, Gmail watcher
``` ```
+1 -1
View File
@@ -55,7 +55,7 @@ You have tools for interacting with your operator's system:
- **process.start / process.status / process.output / process.kill / process.list** -- Manage background processes. - **process.start / process.status / process.output / process.kill / process.list** -- Manage background processes.
- **message.send** -- Send messages to other channels (Telegram, Discord, etc.). - **message.send** -- Send messages to other channels (Telegram, Discord, etc.).
Additional tools (image.analyze, media.send, browser.*, gmail.*, calendar.*, sessions.*, agents.list) may be available depending on configuration. Check your tool definitions if unsure. Additional tools (image.analyze, media.send, audio.transcribe, browser.*, gmail.*, calendar.*, sessions.*, agents.list) may be available depending on configuration. Check your tool definitions if unsure.
## Tool Usage Rules ## Tool Usage Rules
+12
View File
@@ -39,6 +39,7 @@ models:
default: default:
provider: anthropic provider: anthropic
model: claude-sonnet-4-20250514 model: claude-sonnet-4-20250514
# supports_audio: false # Override native audio detection per tier
local: local:
provider: ollama provider: ollama
model: glm-4.7-flash model: glm-4.7-flash
@@ -117,3 +118,14 @@ hooks:
# peer: "123456789" # peer: "123456789"
# failure_threshold: 2 # failure_threshold: 2
# disk_threshold_mb: 100 # disk_threshold_mb: 100
# ── Audio ────────────────────────────────────────────────────────────
# Configure a Whisper-compatible endpoint for audio transcription.
# Models that support native audio input (Gemini, OpenAI, GitHub) will
# receive raw audio directly; others fall back to this endpoint.
# audio:
# transcription_endpoint: "http://localhost:8080/v1/audio/transcriptions"
# transcription_api_key: "${WHISPER_API_KEY}"
# transcription_model: "whisper-1"
# transcription_provider: "openai"