docs: document native audio support across README, CHANGELOG, config, and planning docs

- README: add audio.transcribe to tool list, update media pipeline description,
  add Native Audio Support and Audio Transcription config sections, add
  supports_audio per-tier override example
- SOUL.md: add audio.transcribe to available tools list
- CHANGELOG: add native audio support and audio.transcribe tool entries
- config/default.yaml: add commented audio config section, supports_audio hint
- INTEGRATIONS.md: expand audio section with native passthrough, capabilities,
  smart routing, AudioSource type, token estimation, audio.transcribe tool
- STRUCTURE.md: add capabilities.ts and audio-transcribe.ts to key file listings
- ARCHITECTURE.md: update data flow step 5 to describe smart audio routing
This commit is contained in:
William Valentin
2026-02-11 18:41:53 -08:00
parent 819ac26b3b
commit 5c531a760d
7 changed files with 87 additions and 8 deletions
+46 -4
View File
@@ -12,10 +12,10 @@ Self-hosted personal AI assistant with Telegram and Terminal interfaces.
- **Session Persistence**: SQLite-backed conversation history
- **Fallback Chains**: Automatic failover when primary model fails
- **Hook Engine**: Confirmation system for sensitive operations
- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, system info
- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, audio transcribe, system info
- **Docker Sandboxing**: Per-session container isolation for tool execution
- **Multi-Agent Routing**: Config-driven agent selection per sender/channel with tool profiles
- **Media Pipeline**: Image analysis, outbound attachments, audio transcription across all channels
- **Media Pipeline**: Image analysis, outbound attachments, audio transcription and native audio passthrough across all channels
- **Session Transfer**: Move conversations between frontends
- **CLI**: Full command-line interface (`flynn start`, `send`, `doctor`, `completion`, etc.)
- **Shell Completion**: Auto-generated completions for bash, zsh, and fish with `--install` flag
@@ -143,6 +143,48 @@ models:
local: { provider: ollama, model: qwen2.5:14b }
```
### Native Audio Support
Voice messages from channels can be handled in two ways:
1. **Native passthrough** -- Audio sent directly to models that support audio input (Gemini, OpenAI, GitHub). No transcription step needed.
2. **Whisper transcription** -- Audio transcribed to text via a Whisper-compatible API, then sent as text to models that don't support audio input (Anthropic, Bedrock, Ollama, llama.cpp).
Flynn automatically routes based on the model's capabilities. You can override this per-tier:
```yaml
models:
default:
provider: gemini
model: gemini-2.0-flash
supports_audio: true # Force native audio (auto-detected for known providers)
fast:
provider: anthropic
model: claude-sonnet-4
supports_audio: false # Force transcription (default for Anthropic)
```
### Audio Transcription
Configure a Whisper-compatible endpoint for models that don't support native audio:
```yaml
audio:
transcription_endpoint: "http://localhost:8080/v1/audio/transcriptions"
transcription_api_key: "${WHISPER_API_KEY}" # Optional Bearer token
transcription_model: "whisper-1" # Model name (default: whisper-1)
transcription_provider: "openai" # Provider format: openai (default)
```
| Field | Required | Description |
|-------|----------|-------------|
| `transcription_endpoint` | yes | Whisper-compatible API endpoint |
| `transcription_api_key` | no | Bearer token for authentication |
| `transcription_model` | no | Model name sent in the request (default: `whisper-1`) |
| `transcription_provider` | no | API format: `openai` (default) |
Without an `audio` config, voice messages from non-audio-capable models are silently skipped.
## Telegram Commands
| Command | Description |
@@ -726,12 +768,12 @@ src/
├── hooks/ # Confirmation engine
├── mcp/ # MCP tool server integration
├── memory/ # Persistent memory store + vector search
├── models/ # Model providers + router + media pipeline
├── models/ # Model providers + router + media pipeline + audio capabilities
├── prompt/ # System prompt templating (auto-injects current date/time)
├── sandbox/ # Docker sandboxing
├── session/ # SQLite persistence
├── skills/ # Skill packages
├── tools/ # Builtin tools (shell, file, web, browser, process, media, system.info)
├── tools/ # Builtin tools (shell, file, web, browser, process, media, audio, system.info)
└── automation/ # Cron scheduler, webhooks, heartbeat monitor, Gmail watcher
```