docs: document native audio support across README, CHANGELOG, config, and planning docs

- README: add audio.transcribe to tool list, update media pipeline description, add Native Audio Support and Audio Transcription config sections, add supports_audio per-tier override example - SOUL.md: add audio.transcribe to available tools list - CHANGELOG: add native audio support and audio.transcribe tool entries - config/default.yaml: add commented audio config section, supports_audio hint - INTEGRATIONS.md: expand audio section with native passthrough, capabilities, smart routing, AudioSource type, token estimation, audio.transcribe tool - STRUCTURE.md: add capabilities.ts and audio-transcribe.ts to key file listings - ARCHITECTURE.md: update data flow step 5 to describe smart audio routing
2026-02-11 18:41:53 -08:00
parent 819ac26b3b
commit 5c531a760d
7 changed files with 87 additions and 8 deletions
@@ -12,10 +12,10 @@ Self-hosted personal AI assistant with Telegram and Terminal interfaces.
 - **Session Persistence**: SQLite-backed conversation history
 - **Fallback Chains**: Automatic failover when primary model fails
 - **Hook Engine**: Confirmation system for sensitive operations
- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, system info
+- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, audio transcribe, system info
 - **Docker Sandboxing**: Per-session container isolation for tool execution
 - **Multi-Agent Routing**: Config-driven agent selection per sender/channel with tool profiles
- **Media Pipeline**: Image analysis, outbound attachments, audio transcription across all channels
+- **Media Pipeline**: Image analysis, outbound attachments, audio transcription and native audio passthrough across all channels
 - **Session Transfer**: Move conversations between frontends
 - **CLI**: Full command-line interface (`flynn start`, `send`, `doctor`, `completion`, etc.)
 - **Shell Completion**: Auto-generated completions for bash, zsh, and fish with `--install` flag
@@ -143,6 +143,48 @@ models:
  local: { provider: ollama, model: qwen2.5:14b }
 ```

+### Native Audio Support
+
+Voice messages from channels can be handled in two ways:
+
+1. **Native passthrough** -- Audio sent directly to models that support audio input (Gemini, OpenAI, GitHub). No transcription step needed.
+2. **Whisper transcription** -- Audio transcribed to text via a Whisper-compatible API, then sent as text to models that don't support audio input (Anthropic, Bedrock, Ollama, llama.cpp).
+
+Flynn automatically routes based on the model's capabilities. You can override this per-tier:
+
+```yaml
+models:
+  default:
+    provider: gemini
+    model: gemini-2.0-flash
+    supports_audio: true     # Force native audio (auto-detected for known providers)
+  fast:
+    provider: anthropic
+    model: claude-sonnet-4
+    supports_audio: false    # Force transcription (default for Anthropic)
+```
+
+### Audio Transcription
+
+Configure a Whisper-compatible endpoint for models that don't support native audio:
+
+```yaml
+audio:
+  transcription_endpoint: "http://localhost:8080/v1/audio/transcriptions"
+  transcription_api_key: "${WHISPER_API_KEY}"    # Optional Bearer token
+  transcription_model: "whisper-1"               # Model name (default: whisper-1)
+  transcription_provider: "openai"               # Provider format: openai (default)
+```
+
+| Field | Required | Description |
+|-------|----------|-------------|
+| `transcription_endpoint` | yes | Whisper-compatible API endpoint |
+| `transcription_api_key` | no | Bearer token for authentication |
+| `transcription_model` | no | Model name sent in the request (default: `whisper-1`) |
+| `transcription_provider` | no | API format: `openai` (default) |
+
+Without an `audio` config, voice messages from non-audio-capable models are silently skipped.
+
 ## Telegram Commands

 | Command | Description |
@@ -726,12 +768,12 @@ src/
 ├── hooks/                # Confirmation engine
 ├── mcp/                  # MCP tool server integration
 ├── memory/               # Persistent memory store + vector search
-├── models/               # Model providers + router + media pipeline
+├── models/               # Model providers + router + media pipeline + audio capabilities
 ├── prompt/               # System prompt templating (auto-injects current date/time)
 ├── sandbox/              # Docker sandboxing
 ├── session/              # SQLite persistence
 ├── skills/               # Skill packages
-├── tools/                # Builtin tools (shell, file, web, browser, process, media, system.info)
+├── tools/                # Builtin tools (shell, file, web, browser, process, media, audio, system.info)
 └── automation/           # Cron scheduler, webhooks, heartbeat monitor, Gmail watcher
 ```