API Reference
TTS — Text-to-Speech
LiveConvert Swahili text into natural-sounding audio. Returns base64-encoded WAV in a JSON response.
Endpoint
POST /v1/text-to-speech/{voice_id}
Accepts a JSON body and returns a JSON response containing base64-encoded WAV audio. For text longer than 2,000 characters, the request is automatically processed asynchronously — you will receive a 202 response with a job_id instead. See the Async Jobs reference for polling.
Authentication
Pass your API key in the xi-api-key header. See Authentication.
Path parameters
| Parameter | Type | Description |
|---|---|---|
voice_id | string | The voice to use for synthesis. Currently sauti-swahili-v1. See Voices for available voices. |
Request body
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | The Swahili text to synthesise. Maximum 5,000 characters. Texts over 2,000 characters are processed asynchronously. |
voice_settings | object | No | Optional synthesis parameters. Contains speaking_rate (default 1.0), noise_scale (default 0.667), and noise_scale_duration (default 0.8). |
Response
On success, short texts return HTTP 200 with a JSON body containing the audio. Long texts (>2,000 chars) return HTTP 202 with a job_id for polling.
200 Response schema
| Field | Type | Description |
|---|---|---|
audio_base64 | string | Base64-encoded WAV audio data |
content_type | string | Always audio/wav |
duration_seconds | number | Duration of the generated audio in seconds |
characters_synthesized | integer | Number of characters that were synthesised |
voice_id | string | The voice used for synthesis |
json
{
"audio_base64": "UklGRi4A...",
"content_type": "audio/wav",
"duration_seconds": 1.84,
"characters_synthesized": 42,
"voice_id": "sauti-swahili-v1"
}Example request
bash
curl -X POST https://sauti.finiflowlabs.com/v1/text-to-speech/sauti-swahili-v1 \
-H "xi-api-key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Habari za asubuhi. Unafanya kazi vizuri.",
"voice_settings": {
"speaking_rate": 1.0,
"noise_scale": 0.667,
"noise_scale_duration": 0.8
}
}'python
import requests
import base64
response = requests.post(
"https://sauti.finiflowlabs.com/v1/text-to-speech/sauti-swahili-v1",
headers={"xi-api-key": "YOUR_KEY"},
json={
"text": "Karibu sana.",
"voice_settings": {
"speaking_rate": 1.0,
"noise_scale": 0.667,
"noise_scale_duration": 0.8,
},
},
)
response.raise_for_status()
data = response.json()
audio_bytes = base64.b64decode(data["audio_base64"])
with open("output.wav", "wb") as f:
f.write(audio_bytes)
print(f"Saved {data['duration_seconds']}s of audio")Model details
- Architecture: VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) with full fine-tuning.
- Training data: Google WAXAL
swa_tts— 1,387 training utterances (1,778 total across splits), recorded by native Kiswahili speakers. - Output sample rate: 16kHz mono WAV.
- Latency: under 500ms for inputs up to 200 characters on standard infrastructure. Longer texts scale linearly.
- Naturalness benchmark: improved MOS (Mean Opinion Score) over multilingual baselines on Swahili test sentences.
Error responses
Validation failures return 422. See the full Error Reference.
json
{
"type": "https://sauti.finiflowlabs.com/errors/unprocessable_content",
"title": "Unprocessable Content",
"status": 422,
"detail": "Field 'text' exceeds the maximum length of 5000 characters."
}