From 5c531a760d8880c42381e75d8ca92eae3a8da3db Mon Sep 17 00:00:00 2001 From: William Valentin Date: Wed, 11 Feb 2026 18:41:53 -0800 Subject: [PATCH] docs: document native audio support across README, CHANGELOG, config, and planning docs - README: add audio.transcribe to tool list, update media pipeline description, add Native Audio Support and Audio Transcription config sections, add supports_audio per-tier override example - SOUL.md: add audio.transcribe to available tools list - CHANGELOG: add native audio support and audio.transcribe tool entries - config/default.yaml: add commented audio config section, supports_audio hint - INTEGRATIONS.md: expand audio section with native passthrough, capabilities, smart routing, AudioSource type, token estimation, audio.transcribe tool - STRUCTURE.md: add capabilities.ts and audio-transcribe.ts to key file listings - ARCHITECTURE.md: update data flow step 5 to describe smart audio routing --- .planning/codebase/ARCHITECTURE.md | 2 +- .planning/codebase/INTEGRATIONS.md | 16 ++++++++++ .planning/codebase/STRUCTURE.md | 4 +-- CHANGELOG.md | 9 ++++++ README.md | 50 +++++++++++++++++++++++++++--- SOUL.md | 2 +- config/default.yaml | 12 +++++++ 7 files changed, 87 insertions(+), 8 deletions(-) diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md index c656ec5..50959f0 100644 --- a/.planning/codebase/ARCHITECTURE.md +++ b/.planning/codebase/ARCHITECTURE.md @@ -161,7 +161,7 @@ 2. Adapter calls `onMessage()` callback → `ChannelRegistry.handleInbound()` routes to `MessageHandler` 3. `createMessageRouter()` resolves agent config via `AgentRouter.resolve(channel, senderId)` 4. `getOrCreateAgent()` creates/retrieves `AgentOrchestrator` for the session (cached by `channel:sender:agentConfig`) -5. Audio attachments transcribed if present +5. Audio routing: `supportsAudioInput()` checks provider capability — native audio passed through for Gemini/OpenAI/GitHub, transcribed via Whisper for others 6. `orchestrator.process()` → injects memory context → checks compaction → delegates to `NativeAgent.process()` 7. `NativeAgent.toolLoop()` → sends to `ModelRouter.chat()` → model returns response or tool calls 8. If tool calls: `ToolExecutor.execute()` → policy check → hook check → tool execution → loop back to model diff --git a/.planning/codebase/INTEGRATIONS.md b/.planning/codebase/INTEGRATIONS.md index 8ab9479..e255090 100644 --- a/.planning/codebase/INTEGRATIONS.md +++ b/.planning/codebase/INTEGRATIONS.md @@ -234,6 +234,22 @@ All adapters implement `ChannelAdapter` interface (`src/channels/types.ts`): `co - Supported formats: OGG, MP3, WAV, WebM, MP4, M4A - Integration: Auto-transcribes audio attachments from channels before model processing +**Native Audio Passthrough:** +- Implementation: `src/models/capabilities.ts`, `src/daemon/routing.ts` +- Capability check: `supportsAudioInput(provider, model, override?)` determines if a model can process raw audio +- Audio-capable providers: Gemini (`inlineData`), OpenAI (`input_audio`), GitHub (`input_audio`) +- Non-audio providers: Anthropic, Bedrock, Ollama, llama.cpp (fall back to Whisper transcription) +- Config override: `supports_audio: true/false` per model tier overrides auto-detection +- Smart routing: `createMessageRouter()` checks capability, passes raw `AudioSource` for capable models or transcribes via Whisper for others +- Audio content types: `AudioSource` (`{ type: 'audio', data: string, mimeType: string }`) in `src/models/types.ts` +- Token estimation: `estimateAudioTokens()` in `src/context/tokens.ts` (base64 length -> bytes -> duration at 16kbps -> tokens at 32/sec) + +**Agent Tool: audio.transcribe:** +- Implementation: `src/tools/builtin/audio-transcribe.ts` +- Transcribes audio files on-demand via the configured Whisper-compatible endpoint +- Input: file path or base64 data with MIME type +- Output: transcribed text + ## MCP (Model Context Protocol) **MCP Client:** diff --git a/.planning/codebase/STRUCTURE.md b/.planning/codebase/STRUCTURE.md index 87ae83e..6a65dff 100644 --- a/.planning/codebase/STRUCTURE.md +++ b/.planning/codebase/STRUCTURE.md @@ -150,7 +150,7 @@ flynn/ **`src/models/`:** - Purpose: LLM provider client implementations and tier-based routing - Contains: Provider clients, `ModelRouter`, retry logic, cost estimation, media helpers -- Key files: `src/models/types.ts` (core interfaces), `src/models/router.ts`, `src/models/anthropic.ts`, `src/models/openai.ts`, `src/models/gemini.ts`, `src/models/bedrock.ts`, `src/models/github.ts`, `src/models/retry.ts`, `src/models/costs.ts`, `src/models/media.ts` +- Key files: `src/models/types.ts` (core interfaces), `src/models/router.ts`, `src/models/anthropic.ts`, `src/models/openai.ts`, `src/models/gemini.ts`, `src/models/bedrock.ts`, `src/models/github.ts`, `src/models/retry.ts`, `src/models/costs.ts`, `src/models/media.ts`, `src/models/capabilities.ts` **`src/models/local/`:** - Purpose: Local model provider clients @@ -185,7 +185,7 @@ flynn/ **`src/tools/builtin/`:** - Purpose: Built-in tool implementations shipped with Flynn - Contains: Shell exec, file operations, web fetch, memory ops, web search, media send, image analysis, session management, agent listing, cross-channel messaging, cron management -- Key files: `src/tools/builtin/shell.ts`, `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-patch.ts`, `src/tools/builtin/file-list.ts`, `src/tools/builtin/web-fetch.ts`, `src/tools/builtin/web-search.ts`, `src/tools/builtin/memory-read.ts`, `src/tools/builtin/memory-write.ts`, `src/tools/builtin/memory-search.ts`, `src/tools/builtin/media-send.ts`, `src/tools/builtin/image-analyze.ts`, `src/tools/builtin/system-info.ts`, `src/tools/builtin/sessions.ts`, `src/tools/builtin/agents-list.ts`, `src/tools/builtin/message-send.ts`, `src/tools/builtin/cron.ts` +- Key files: `src/tools/builtin/shell.ts`, `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-patch.ts`, `src/tools/builtin/file-list.ts`, `src/tools/builtin/web-fetch.ts`, `src/tools/builtin/web-search.ts`, `src/tools/builtin/memory-read.ts`, `src/tools/builtin/memory-write.ts`, `src/tools/builtin/memory-search.ts`, `src/tools/builtin/media-send.ts`, `src/tools/builtin/image-analyze.ts`, `src/tools/builtin/audio-transcribe.ts`, `src/tools/builtin/system-info.ts`, `src/tools/builtin/sessions.ts`, `src/tools/builtin/agents-list.ts`, `src/tools/builtin/message-send.ts`, `src/tools/builtin/cron.ts` **`src/tools/builtin/browser/`:** - Purpose: Puppeteer-based browser automation tools diff --git a/CHANGELOG.md b/CHANGELOG.md index ced0f91..59832a7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,15 @@ All notable changes to Flynn are documented in this file. ### Added +- **Native Audio Support** -- Smart routing for voice messages: audio-capable models + (Gemini, OpenAI, GitHub) receive raw audio directly via `AudioSource` content parts; + non-audio models (Anthropic, Bedrock, Ollama, llama.cpp) get Whisper transcription + fallback. `supportsAudioInput()` capability check with per-model `supports_audio` + config override. Audio token estimation (base64 -> bytes -> duration -> tokens at + 32 tokens/sec). 38 new tests (18 capabilities + 15 media + 5 token estimation). +- **Agent Tool: audio.transcribe** -- Transcribe audio files via a Whisper-compatible + API endpoint. Configurable via `audio.transcription_endpoint`, supports OGG, MP3, + WAV, WebM, MP4, M4A formats. - **xAI (Grok) Provider** -- xAI as OpenAI-compatible model provider with `provider: xai` config. Supports grok-3, grok-3-mini, grok-2, grok-2-mini, grok-3-fast. Uses `XAI_API_KEY` env var or config `api_key`. diff --git a/README.md b/README.md index 49f2682..b9e4a5c 100644 --- a/README.md +++ b/README.md @@ -12,10 +12,10 @@ Self-hosted personal AI assistant with Telegram and Terminal interfaces. - **Session Persistence**: SQLite-backed conversation history - **Fallback Chains**: Automatic failover when primary model fails - **Hook Engine**: Confirmation system for sensitive operations -- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, system info +- **Tool Framework**: Shell, file, file patch, web-fetch, web-search, browser control, image analysis, media send, audio transcribe, system info - **Docker Sandboxing**: Per-session container isolation for tool execution - **Multi-Agent Routing**: Config-driven agent selection per sender/channel with tool profiles -- **Media Pipeline**: Image analysis, outbound attachments, audio transcription across all channels +- **Media Pipeline**: Image analysis, outbound attachments, audio transcription and native audio passthrough across all channels - **Session Transfer**: Move conversations between frontends - **CLI**: Full command-line interface (`flynn start`, `send`, `doctor`, `completion`, etc.) - **Shell Completion**: Auto-generated completions for bash, zsh, and fish with `--install` flag @@ -143,6 +143,48 @@ models: local: { provider: ollama, model: qwen2.5:14b } ``` +### Native Audio Support + +Voice messages from channels can be handled in two ways: + +1. **Native passthrough** -- Audio sent directly to models that support audio input (Gemini, OpenAI, GitHub). No transcription step needed. +2. **Whisper transcription** -- Audio transcribed to text via a Whisper-compatible API, then sent as text to models that don't support audio input (Anthropic, Bedrock, Ollama, llama.cpp). + +Flynn automatically routes based on the model's capabilities. You can override this per-tier: + +```yaml +models: + default: + provider: gemini + model: gemini-2.0-flash + supports_audio: true # Force native audio (auto-detected for known providers) + fast: + provider: anthropic + model: claude-sonnet-4 + supports_audio: false # Force transcription (default for Anthropic) +``` + +### Audio Transcription + +Configure a Whisper-compatible endpoint for models that don't support native audio: + +```yaml +audio: + transcription_endpoint: "http://localhost:8080/v1/audio/transcriptions" + transcription_api_key: "${WHISPER_API_KEY}" # Optional Bearer token + transcription_model: "whisper-1" # Model name (default: whisper-1) + transcription_provider: "openai" # Provider format: openai (default) +``` + +| Field | Required | Description | +|-------|----------|-------------| +| `transcription_endpoint` | yes | Whisper-compatible API endpoint | +| `transcription_api_key` | no | Bearer token for authentication | +| `transcription_model` | no | Model name sent in the request (default: `whisper-1`) | +| `transcription_provider` | no | API format: `openai` (default) | + +Without an `audio` config, voice messages from non-audio-capable models are silently skipped. + ## Telegram Commands | Command | Description | @@ -726,12 +768,12 @@ src/ ├── hooks/ # Confirmation engine ├── mcp/ # MCP tool server integration ├── memory/ # Persistent memory store + vector search -├── models/ # Model providers + router + media pipeline +├── models/ # Model providers + router + media pipeline + audio capabilities ├── prompt/ # System prompt templating (auto-injects current date/time) ├── sandbox/ # Docker sandboxing ├── session/ # SQLite persistence ├── skills/ # Skill packages -├── tools/ # Builtin tools (shell, file, web, browser, process, media, system.info) +├── tools/ # Builtin tools (shell, file, web, browser, process, media, audio, system.info) └── automation/ # Cron scheduler, webhooks, heartbeat monitor, Gmail watcher ``` diff --git a/SOUL.md b/SOUL.md index dbac901..7f9118f 100644 --- a/SOUL.md +++ b/SOUL.md @@ -55,7 +55,7 @@ You have tools for interacting with your operator's system: - **process.start / process.status / process.output / process.kill / process.list** -- Manage background processes. - **message.send** -- Send messages to other channels (Telegram, Discord, etc.). -Additional tools (image.analyze, media.send, browser.*, gmail.*, calendar.*, sessions.*, agents.list) may be available depending on configuration. Check your tool definitions if unsure. +Additional tools (image.analyze, media.send, audio.transcribe, browser.*, gmail.*, calendar.*, sessions.*, agents.list) may be available depending on configuration. Check your tool definitions if unsure. ## Tool Usage Rules diff --git a/config/default.yaml b/config/default.yaml index 71e9f37..56f1849 100644 --- a/config/default.yaml +++ b/config/default.yaml @@ -39,6 +39,7 @@ models: default: provider: anthropic model: claude-sonnet-4-20250514 + # supports_audio: false # Override native audio detection per tier local: provider: ollama model: glm-4.7-flash @@ -117,3 +118,14 @@ hooks: # peer: "123456789" # failure_threshold: 2 # disk_threshold_mb: 100 + +# ── Audio ──────────────────────────────────────────────────────────── +# Configure a Whisper-compatible endpoint for audio transcription. +# Models that support native audio input (Gemini, OpenAI, GitHub) will +# receive raw audio directly; others fall back to this endpoint. + +# audio: +# transcription_endpoint: "http://localhost:8080/v1/audio/transcriptions" +# transcription_api_key: "${WHISPER_API_KEY}" +# transcription_model: "whisper-1" +# transcription_provider: "openai"