API Reference

TTS — Text-to-Speech

Live

Convert Swahili text into natural-sounding audio. Returns base64-encoded WAV in a JSON response.

Endpoint

POST /v1/text-to-speech/{voice_id}

Accepts a JSON body and returns a JSON response containing base64-encoded WAV audio. For text longer than 2,000 characters, the request is automatically processed asynchronously — you will receive a 202 response with a job_id instead. See the Async Jobs reference for polling.

Authentication

Pass your API key in the xi-api-key header. See Authentication.

Path parameters

ParameterTypeDescription
voice_idstringThe voice to use for synthesis. Currently sauti-swahili-v1. See Voices for available voices.

Request body

FieldTypeRequiredDescription
textstringYesThe Swahili text to synthesise. Maximum 5,000 characters. Texts over 2,000 characters are processed asynchronously.
voice_settingsobjectNoOptional synthesis parameters. Contains speaking_rate (default 1.0), noise_scale (default 0.667), and noise_scale_duration (default 0.8).

Response

On success, short texts return HTTP 200 with a JSON body containing the audio. Long texts (>2,000 chars) return HTTP 202 with a job_id for polling.

200 Response schema

FieldTypeDescription
audio_base64stringBase64-encoded WAV audio data
content_typestringAlways audio/wav
duration_secondsnumberDuration of the generated audio in seconds
characters_synthesizedintegerNumber of characters that were synthesised
voice_idstringThe voice used for synthesis
json
{
  "audio_base64": "UklGRi4A...",
  "content_type": "audio/wav",
  "duration_seconds": 1.84,
  "characters_synthesized": 42,
  "voice_id": "sauti-swahili-v1"
}

Example request

bash
curl -X POST https://sauti.finiflowlabs.com/v1/text-to-speech/sauti-swahili-v1 \
  -H "xi-api-key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Habari za asubuhi. Unafanya kazi vizuri.",
    "voice_settings": {
      "speaking_rate": 1.0,
      "noise_scale": 0.667,
      "noise_scale_duration": 0.8
    }
  }'
python
import requests
import base64

response = requests.post(
    "https://sauti.finiflowlabs.com/v1/text-to-speech/sauti-swahili-v1",
    headers={"xi-api-key": "YOUR_KEY"},
    json={
        "text": "Karibu sana.",
        "voice_settings": {
            "speaking_rate": 1.0,
            "noise_scale": 0.667,
            "noise_scale_duration": 0.8,
        },
    },
)
response.raise_for_status()

data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])

with open("output.wav", "wb") as f:
    f.write(audio_bytes)

print(f"Saved {data['duration_seconds']}s of audio")

Model details

  • Architecture: VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) with full fine-tuning.
  • Training data: Google WAXAL swa_tts — 1,387 training utterances (1,778 total across splits), recorded by native Kiswahili speakers.
  • Output sample rate: 16kHz mono WAV.
  • Latency: under 500ms for inputs up to 200 characters on standard infrastructure. Longer texts scale linearly.
  • Naturalness benchmark: improved MOS (Mean Opinion Score) over multilingual baselines on Swahili test sentences.

Error responses

Validation failures return 422. See the full Error Reference.

json
{
  "type": "https://sauti.finiflowlabs.com/errors/unprocessable_content",
  "title": "Unprocessable Content",
  "status": 422,
  "detail": "Field 'text' exceeds the maximum length of 5000 characters."
}