API Reference
Voice Agent API
BetaAn end-to-end Swahili voice agent — combines speech recognition, an LLM dialogue layer, and natural Swahili speech synthesis behind a single endpoint.
Overview
The Voice Agent composes the SAUTI ASR, dialogue, and TTS stacks into a single conversational turn. You can drive it with text (skip the ASR step) or with audio (full pipeline). Each call returns both the generated reply and synthesized Swahili audio.
Three preconfigured scenarios are available: general, banking, and health.
Text mode
POST /v1/voice-agent/converse
JSON body with the following fields:
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | The user's message in Swahili. |
scenario | string | No | One of general, banking, health. Defaults to general. |
voice_id | string | No | Voice to use for the synthesized reply. |
bash
curl -X POST https://sauti.finiflowlabs.com/v1/voice-agent/converse \
-H "xi-api-key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Habari, naomba salio langu.",
"scenario": "banking",
"voice_id": "sauti-swahili-v1"
}'Audio mode
POST /v1/voice-agent/converse/audio
multipart/form-data upload. Runs the full ASR → LLM → TTS pipeline. Maximum 25 MB per audio turn.
| Field | Type | Required | Description |
|---|---|---|---|
audio | file | Yes | User audio (WAV, MP3, WebM). Max 25 MB. |
scenario | string | No | Defaults to general. |
voice_id | string | No | Defaults to mms-swahili-v1. |
bash
curl -X POST https://sauti.finiflowlabs.com/v1/voice-agent/converse/audio \
-H "xi-api-key: YOUR_KEY" \
-F "audio=@user_turn.wav;type=audio/wav" \
-F "scenario=general" \
-F "voice_id=sauti-swahili-v1"Response
json
{
"user_text": "Habari, naomba salio langu.",
"agent_text": "Habari! Salio lako ni shilingi elfu kumi na tano.",
"audio_base64": "UklGRi4A...",
"content_type": "audio/wav",
"scenario": "banking",
"voice_id": "sauti-swahili-v1"
}Notes
- The dialogue layer is LLM-agnostic. Available backends include Claude, OpenAI, and Groq — selected server-side via configuration.
- Streaming and webhook delivery are on the roadmap. Today the API returns one fully-rendered turn per request.
- Try it interactively in the Voice Agent playground.