API Reference

Voice Agent API

Beta

An end-to-end Swahili voice agent — combines speech recognition, an LLM dialogue layer, and natural Swahili speech synthesis behind a single endpoint.

Overview

The Voice Agent composes the SAUTI ASR, dialogue, and TTS stacks into a single conversational turn. You can drive it with text (skip the ASR step) or with audio (full pipeline). Each call returns both the generated reply and synthesized Swahili audio.

Three preconfigured scenarios are available: general, banking, and health.

Text mode

POST /v1/voice-agent/converse

JSON body with the following fields:

FieldTypeRequiredDescription
textstringYesThe user's message in Swahili.
scenariostringNoOne of general, banking, health. Defaults to general.
voice_idstringNoVoice to use for the synthesized reply.
bash
curl -X POST https://sauti.finiflowlabs.com/v1/voice-agent/converse \
  -H "xi-api-key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Habari, naomba salio langu.",
    "scenario": "banking",
    "voice_id": "sauti-swahili-v1"
  }'

Audio mode

POST /v1/voice-agent/converse/audio

multipart/form-data upload. Runs the full ASR → LLM → TTS pipeline. Maximum 25 MB per audio turn.

FieldTypeRequiredDescription
audiofileYesUser audio (WAV, MP3, WebM). Max 25 MB.
scenariostringNoDefaults to general.
voice_idstringNoDefaults to mms-swahili-v1.
bash
curl -X POST https://sauti.finiflowlabs.com/v1/voice-agent/converse/audio \
  -H "xi-api-key: YOUR_KEY" \
  -F "audio=@user_turn.wav;type=audio/wav" \
  -F "scenario=general" \
  -F "voice_id=sauti-swahili-v1"

Response

json
{
  "user_text": "Habari, naomba salio langu.",
  "agent_text": "Habari! Salio lako ni shilingi elfu kumi na tano.",
  "audio_base64": "UklGRi4A...",
  "content_type": "audio/wav",
  "scenario": "banking",
  "voice_id": "sauti-swahili-v1"
}

Notes

  • The dialogue layer is LLM-agnostic. Available backends include Claude, OpenAI, and Groq — selected server-side via configuration.
  • Streaming and webhook delivery are on the roadmap. Today the API returns one fully-rendered turn per request.
  • Try it interactively in the Voice Agent playground.