API Reference

TTS — Text-to-Speech

Live

Convert Swahili text into natural-sounding audio. Returns base64-encoded WAV in a JSON response.

Endpoint

POST /v1/text-to-speech/{voice_id}

Accepts a JSON body and returns a JSON response containing base64-encoded WAV audio. For text longer than 2,000 characters, the request is automatically processed asynchronously — you will receive a 202 response with a job_id instead. See the Async Jobs reference for polling.

Authentication

Pass your API key in the xi-api-key header. See Authentication.

Path parameters

Parameter	Type	Description
`voice_id`	string	The voice to use for synthesis. Currently `sauti-swahili-v1`. See Voices for available voices.

Request body

Field	Type	Required	Description
`text`	string	Yes	The Swahili text to synthesise. Maximum 5,000 characters. Texts over 2,000 characters are processed asynchronously.
`voice_settings`	object	No	Optional synthesis parameters. Contains `speaking_rate` (default 1.0), `noise_scale` (default 0.667), and `noise_scale_duration` (default 0.8).

Response

On success, short texts return HTTP 200 with a JSON body containing the audio. Long texts (>2,000 chars) return HTTP 202 with a job_id for polling.

200 Response schema

Field	Type	Description
`audio_base64`	string	Base64-encoded WAV audio data
`content_type`	string	Always `audio/wav`
`duration_seconds`	number	Duration of the generated audio in seconds
`characters_synthesized`	integer	Number of characters that were synthesised
`voice_id`	string	The voice used for synthesis

json

{
  "audio_base64": "UklGRi4A...",
  "content_type": "audio/wav",
  "duration_seconds": 1.84,
  "characters_synthesized": 42,
  "voice_id": "sauti-swahili-v1"
}

Example request

bash

curl -X POST https://sauti.finiflowlabs.com/v1/text-to-speech/sauti-swahili-v1 \
  -H "xi-api-key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Habari za asubuhi. Unafanya kazi vizuri.",
    "voice_settings": {
      "speaking_rate": 1.0,
      "noise_scale": 0.667,
      "noise_scale_duration": 0.8
    }
  }'

python

import requests
import base64

response = requests.post(
    "https://sauti.finiflowlabs.com/v1/text-to-speech/sauti-swahili-v1",
    headers={"xi-api-key": "YOUR_KEY"},
    json={
        "text": "Karibu sana.",
        "voice_settings": {
            "speaking_rate": 1.0,
            "noise_scale": 0.667,
            "noise_scale_duration": 0.8,
        },
    },
)
response.raise_for_status()

data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])

with open("output.wav", "wb") as f:
    f.write(audio_bytes)

print(f"Saved {data['duration_seconds']}s of audio")

Model details

Architecture: VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) with full fine-tuning.
Training data: Google WAXAL swa_tts — 1,387 training utterances (1,778 total across splits), recorded by native Kiswahili speakers.
Output sample rate: 16kHz mono WAV.
Latency: under 500ms for inputs up to 200 characters on standard infrastructure. Longer texts scale linearly.
Naturalness benchmark: improved MOS (Mean Opinion Score) over multilingual baselines on Swahili test sentences.

Error responses

Validation failures return 422. See the full Error Reference.

json

{
  "type": "https://sauti.finiflowlabs.com/errors/unprocessable_content",
  "title": "Unprocessable Content",
  "status": 422,
  "detail": "Field 'text' exceeds the maximum length of 5000 characters."
}