API Reference

Voice Agent API

Beta

An end-to-end Swahili voice agent — combines speech recognition, an LLM dialogue layer, and natural Swahili speech synthesis behind a single endpoint.

Overview

The Voice Agent composes the SAUTI ASR, dialogue, and TTS stacks into a single conversational turn. You can drive it with text (skip the ASR step) or with audio (full pipeline). Each call returns both the generated reply and synthesized Swahili audio.

Three preconfigured scenarios are available: general, banking, and health.

Text mode

POST /v1/voice-agent/converse

JSON body with the following fields:

Field	Type	Required	Description
`text`	string	Yes	The user's message in Swahili.
`scenario`	string	No	One of `general`, `banking`, `health`. Defaults to `general`.
`voice_id`	string	No	Voice to use for the synthesized reply.

bash

curl -X POST https://sauti.finiflowlabs.com/v1/voice-agent/converse \
  -H "xi-api-key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Habari, naomba salio langu.",
    "scenario": "banking",
    "voice_id": "sauti-swahili-v1"
  }'

Audio mode

POST /v1/voice-agent/converse/audio

multipart/form-data upload. Runs the full ASR → LLM → TTS pipeline. Maximum 25 MB per audio turn.

Field	Type	Required	Description
`audio`	file	Yes	User audio (WAV, MP3, WebM). Max 25 MB.
`scenario`	string	No	Defaults to `general`.
`voice_id`	string	No	Defaults to `mms-swahili-v1`.

bash

curl -X POST https://sauti.finiflowlabs.com/v1/voice-agent/converse/audio \
  -H "xi-api-key: YOUR_KEY" \
  -F "audio=@user_turn.wav;type=audio/wav" \
  -F "scenario=general" \
  -F "voice_id=sauti-swahili-v1"

Response

json

{
  "user_text": "Habari, naomba salio langu.",
  "agent_text": "Habari! Salio lako ni shilingi elfu kumi na tano.",
  "audio_base64": "UklGRi4A...",
  "content_type": "audio/wav",
  "scenario": "banking",
  "voice_id": "sauti-swahili-v1"
}

Notes

The dialogue layer is LLM-agnostic. Available backends include Claude, OpenAI, and Groq — selected server-side via configuration.
Streaming and webhook delivery are on the roadmap. Today the API returns one fully-rendered turn per request.
Try it interactively in the Voice Agent playground.