As more people turn to self hosted voice applications, many are looking for reliable local text-to-speech (TTS) systems to avoid relying on external APIs like OpenAI, Google, or ElevenLabs. One popular choice is Kokoro TTS, a lightweight and high-performing model that has gained a lot of attention due to its impressive features and ease of use.
What is Kokoro TTS?
Kokoro TTS, a surprisingly powerful text-to-speech model that packs a punch despite its compact size. Not only is it available to anyone with access to Hugging Face and GitHub, but it also delivers incredibly impressive results often taking the top spot on the platform's leaderboard. What sets Kokoro TTS apart from larger systems though, is its unique ability to run locally even without a powerful graphics processing unit (GPU). This makes it accessible to a much wider range of users, giving everyone a chance to try out this innovative technology.
What is the Kokoro FastAPI Wrapper?
A Dockerized FastAPI wrapper for the Kokoro-82M text-to-speech model provides CPU ONNX support, as well as integration with NVIDIA GPUs and PyTorch. This wrapper also handles the model and offers automatic stitching capabilities.

Kokoro FastAPI Features
The Kokoro TTS model supports multiple languages, including English, Japanese, Korean, and Chinese. Vietnamese is currently in development. This multilingual capability makes the model accessible to users from diverse linguistic backgrounds.
For high-performance computing, the model can be accelerated using NVIDIA GPUs or run on CPU with PyTorch support. The Speech endpoint is also compatible with OpenAI, providing a seamless integration experience for users already familiar with this platform.
ONNX support is available in versions prior to v0.1.5, but an updated version is planned, which will bring native ONNX support to the model. Debug endpoints are available for tracking system statistics, and an integrated web UI on localhost:8880/web provides a convenient interface for accessing key metrics and adjusting settings.
The Kokoro TTS model offers phoneme-based audio generation and per-word timestamped caption generation, allowing developers to create custom sounds and voices with precision. Advanced voice mixing capabilities also enable users to combine different voices or weights to achieve a desired tone or timbre.
These features make the Kokoro TTS model well suited for a wide range of applications requiring high quality text to speech functionality. By providing flexible technical capabilities and advanced features, the model is positioned to meet the needs of developers and users alike. Its ability to handle multiple languages and provide customizable voice options sets it apart from other models on the market.
Listen to the features generated by Kokoro-82M
Installing Kokoro FastAPI using Docker
If you don't have Docker installed and need help getting started, I recommend checking out our self-hosting guides for beginners. These guides cover the basics of setting up Docker on your server. They're designed to help you get up and running smoothly, even if you're new to self-hosting.
Use the following Docker Compose config to install Kokoro FastAPI:
services:
kokoro-fastapi-cpu:
ports:
- 8880:8880
image: ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.2This will install the CPU version of the application which is surprisingly fast!
Kokoro FastAPI Web Interface
Now navigate to your server IP on port 8880/web to view the Kokoro FastAPI web interface.

Place text in the box, choose a voice then press "Generate Speech". That's really all there is to it! You can download the output as a WAV, MP3 or PCE file to save for later use.
It was said that when playing the audio from the web interface, the quality is set to "stream" and may be lower than the quality of the downloaded version of the file.
Here is an example of the opening paragraph of this article:
You can also adjust the "Voice Weight" value (on the right side of the voice) to modify the tone of the voice. By adjusting this setting, you can blend two or more voices together to create an entirely unique voice profile.
For some reason I was getting errors when trying to adjust voice weights. I am trying to figure out why and I do have an open issue here regarding this problem.
Connect Kokoro FastAPI to Open WebUI
What good would this guide be without showing how the API can work with another platform? As a bonus, I'll help you get Kokoro FastAPI connected to Open WebUI so you can use it to read aloud your conversations.

- In the Open WebUI Admin panel, go to Settings
- Click Audio
- Change the TTS Settings to use OpenAI
- Add the IP and port that Kokoro FastAPI is running on with a /v1 so it looks like this: http://192.168.1.45:8880/v1 (use your server IP)
- Password is: not-needed
- Choose a voice from the Kokoro voices (You can see the names in the web ui)
- Choose the tts-1-hs TTS Model
- Click Save
Pretty simple! Now you can click the speaker icon on any output in Open WebUI to have the voices from Kokoro FastAPI read them aloud to you.
Stats for Nerds
As I could not get the GPU version of Kokoro FastAPI running on my system due to outdated CUDA, I was forced to use the CPU version. And much to my surprise it is much faster than I anticipated.

Final Notes and Thoughts
Another fantastic self hosted TTS application. I had a lot of fun with this one and was pleasantly surprised how well it ran on my CPU. Granted, I do have a pretty beefy CPU but nonetheless, it was a fun experience.
Kokoro FastAPI does not have as many options and customizations like Zonos does, but it still presents a very great start to a TTS platform that sounds incredibly realistic!
Keep in mind these projects are under active deployment and you may run into issues like I did with the voice weights etc. If you do encounter issues, please use the Kokoro FastAPI github issue tracker.
Someone created another project to use this with Home Assistant


Discussion