SAUTI ASR v1: Fine-tuning Whisper for Swahili — from 27% to 13.5% WER

Off-the-shelf multilingual ASR gets Swahili wrong one word in four. We fine-tuned Whisper-medium on local Swahili data and halved that error rate to 13.5%. Here is what we learned.

Overview

SAUTI ASR v1 is our Swahili speech recognition model. Starting from a multilingual baseline that got roughly one word in four wrong, we applied targeted fine-tuning on Swahili speech data and cut the error rate in half.

Why it matters

Voice AI only works if it can understand what people say. For Swahili — spoken by over 200 million people — off-the-shelf models fall short. SAUTI ASR v1 brings Swahili speech recognition to a level where voice agents, transcription services, and real-time translation become viable.

Results

System	Word Error Rate
Multilingual baseline (zero-shot)	27.2%
SAUTI ASR v1 (fine-tuned)	13.5%

A 50% relative reduction in errors — moving Swahili ASR from unusable to production-viable.

Availability

SAUTI ASR v1 is published on [HuggingFace](https://huggingface.co/Finiflowlabs/sauti-asr-v1) under an open license. Try it in the [Speech to Text playground](/speech-to-text). The model powers the ASR stage of our voice agent and real-time translation pipelines.