docs: document native audio support across README, CHANGELOG, config, and planning docs
- README: add audio.transcribe to tool list, update media pipeline description, add Native Audio Support and Audio Transcription config sections, add supports_audio per-tier override example - SOUL.md: add audio.transcribe to available tools list - CHANGELOG: add native audio support and audio.transcribe tool entries - config/default.yaml: add commented audio config section, supports_audio hint - INTEGRATIONS.md: expand audio section with native passthrough, capabilities, smart routing, AudioSource type, token estimation, audio.transcribe tool - STRUCTURE.md: add capabilities.ts and audio-transcribe.ts to key file listings - ARCHITECTURE.md: update data flow step 5 to describe smart audio routing
This commit is contained in:
@@ -12,10 +12,10 @@ Self-hosted personal AI assistant with Telegram and Terminal interfaces.
|
||||
- **Session Persistence**: SQLite-backed conversation history
|
||||
- **Fallback Chains**: Automatic failover when primary model fails
|
||||
- **Hook Engine**: Confirmation system for sensitive operations
|
||||
- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, system info
|
||||
- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, audio transcribe, system info
|
||||
- **Docker Sandboxing**: Per-session container isolation for tool execution
|
||||
- **Multi-Agent Routing**: Config-driven agent selection per sender/channel with tool profiles
|
||||
- **Media Pipeline**: Image analysis, outbound attachments, audio transcription across all channels
|
||||
- **Media Pipeline**: Image analysis, outbound attachments, audio transcription and native audio passthrough across all channels
|
||||
- **Session Transfer**: Move conversations between frontends
|
||||
- **CLI**: Full command-line interface (`flynn start`, `send`, `doctor`, `completion`, etc.)
|
||||
- **Shell Completion**: Auto-generated completions for bash, zsh, and fish with `--install` flag
|
||||
@@ -143,6 +143,48 @@ models:
|
||||
local: { provider: ollama, model: qwen2.5:14b }
|
||||
```
|
||||
|
||||
### Native Audio Support
|
||||
|
||||
Voice messages from channels can be handled in two ways:
|
||||
|
||||
1. **Native passthrough** -- Audio sent directly to models that support audio input (Gemini, OpenAI, GitHub). No transcription step needed.
|
||||
2. **Whisper transcription** -- Audio transcribed to text via a Whisper-compatible API, then sent as text to models that don't support audio input (Anthropic, Bedrock, Ollama, llama.cpp).
|
||||
|
||||
Flynn automatically routes based on the model's capabilities. You can override this per-tier:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
default:
|
||||
provider: gemini
|
||||
model: gemini-2.0-flash
|
||||
supports_audio: true # Force native audio (auto-detected for known providers)
|
||||
fast:
|
||||
provider: anthropic
|
||||
model: claude-sonnet-4
|
||||
supports_audio: false # Force transcription (default for Anthropic)
|
||||
```
|
||||
|
||||
### Audio Transcription
|
||||
|
||||
Configure a Whisper-compatible endpoint for models that don't support native audio:
|
||||
|
||||
```yaml
|
||||
audio:
|
||||
transcription_endpoint: "http://localhost:8080/v1/audio/transcriptions"
|
||||
transcription_api_key: "${WHISPER_API_KEY}" # Optional Bearer token
|
||||
transcription_model: "whisper-1" # Model name (default: whisper-1)
|
||||
transcription_provider: "openai" # Provider format: openai (default)
|
||||
```
|
||||
|
||||
| Field | Required | Description |
|
||||
|-------|----------|-------------|
|
||||
| `transcription_endpoint` | yes | Whisper-compatible API endpoint |
|
||||
| `transcription_api_key` | no | Bearer token for authentication |
|
||||
| `transcription_model` | no | Model name sent in the request (default: `whisper-1`) |
|
||||
| `transcription_provider` | no | API format: `openai` (default) |
|
||||
|
||||
Without an `audio` config, voice messages from non-audio-capable models are silently skipped.
|
||||
|
||||
## Telegram Commands
|
||||
|
||||
| Command | Description |
|
||||
@@ -726,12 +768,12 @@ src/
|
||||
├── hooks/ # Confirmation engine
|
||||
├── mcp/ # MCP tool server integration
|
||||
├── memory/ # Persistent memory store + vector search
|
||||
├── models/ # Model providers + router + media pipeline
|
||||
├── models/ # Model providers + router + media pipeline + audio capabilities
|
||||
├── prompt/ # System prompt templating (auto-injects current date/time)
|
||||
├── sandbox/ # Docker sandboxing
|
||||
├── session/ # SQLite persistence
|
||||
├── skills/ # Skill packages
|
||||
├── tools/ # Builtin tools (shell, file, web, browser, process, media, system.info)
|
||||
├── tools/ # Builtin tools (shell, file, web, browser, process, media, audio, system.info)
|
||||
└── automation/ # Cron scheduler, webhooks, heartbeat monitor, Gmail watcher
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user