Kokoro FastAPI - Self Hosted Text to Speech Platform Installation Guide

As more people turn to self hosted voice applications, many are looking for reliable local text-to-speech (TTS) systems to avoid relying on external APIs like OpenAI, Google, or ElevenLabs. One popular choice is Kokoro TTS, a lightweight and high-performing model that has gained a lot of attention due to its impressive features and ease of use.

What is Kokoro TTS?

Kokoro TTS, a surprisingly powerful text-to-speech model that packs a punch despite its compact size. Not only is it available to anyone with access to Hugging Face and GitHub, but it also delivers incredibly impressive results often taking the top spot on the platform's leaderboard. What sets Kokoro TTS apart from larger systems though, is its unique ability to run locally even without a powerful graphics processing unit (GPU). This makes it accessible to a much wider range of users, giving everyone a chance to try out this innovative technology.

Kokoro-82M Output Example af_heart

0:00

/20.75

What is the Kokoro FastAPI Wrapper?

A Dockerized FastAPI wrapper for the Kokoro-82M text-to-speech model provides CPU ONNX support, as well as integration with NVIDIA GPUs and PyTorch. This wrapper also handles the model and offers automatic stitching capabilities.

Kokoro FastAPI Features

The Kokoro TTS model supports multiple languages, including English, Japanese, Korean, and Chinese. Vietnamese is currently in development. This multilingual capability makes the model accessible to users from diverse linguistic backgrounds.

For high-performance computing, the model can be accelerated using NVIDIA GPUs or run on CPU with PyTorch support. The Speech endpoint is also compatible with OpenAI, providing a seamless integration experience for users already familiar with this platform.

ONNX support is available in versions prior to v0.1.5, but an updated version is planned, which will bring native ONNX support to the model. Debug endpoints are available for tracking system statistics, and an integrated web UI on localhost:8880/web provides a convenient interface for accessing key metrics and adjusting settings.

The Kokoro TTS model offers phoneme-based audio generation and per-word timestamped caption generation, allowing developers to create custom sounds and voices with precision. Advanced voice mixing capabilities also enable users to combine different voices or weights to achieve a desired tone or timbre.

These features make the Kokoro TTS model well suited for a wide range of applications requiring high quality text to speech functionality. By providing flexible technical capabilities and advanced features, the model is positioned to meet the needs of developers and users alike. Its ability to handle multiple languages and provide customizable voice options sets it apart from other models on the market.

Listen to the features generated by Kokoro-82M

Kokoros FastAPI Features am_echo

0:00

/98.573377

Installing Kokoro FastAPI using Docker

If you don't have Docker installed and need help getting started, I recommend checking out our self-hosting guides for beginners. These guides cover the basics of setting up Docker on your server. They're designed to help you get up and running smoothly, even if you're new to self-hosting.

Use the following Docker Compose config to install Kokoro FastAPI:

services:
    kokoro-fastapi-cpu:
        ports:
            - 8880:8880
        image: ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.2

This will install the CPU version of the application which is surprisingly fast!

Kokoro FastAPI Web Interface

Now navigate to your server IP on port 8880/web to view the Kokoro FastAPI web interface.

Place text in the box, choose a voice then press "Generate Speech". That's really all there is to it! You can download the output as a WAV, MP3 or PCE file to save for later use.

It was said that when playing the audio from the web interface, the quality is set to "stream" and may be lower than the quality of the downloaded version of the file.

💡

A PCM file is essentially a digital representation of sound, achieved through pulse-code modulation (PCM). This format is widely used in various industries, including computing, music storage, and telecommunications. As the standard for digital audio, PCM files are commonly found on computers, compact discs, and digital devices.

Here is an example of the opening paragraph of this article:

Kokoros-82M Output Example 2 af_sky

0:00

/22.214084

You can also adjust the "Voice Weight" value (on the right side of the voice) to modify the tone of the voice. By adjusting this setting, you can blend two or more voices together to create an entirely unique voice profile.

For some reason I was getting errors when trying to adjust voice weights. I am trying to figure out why and I do have an open issue here regarding this problem.

Connect Kokoro FastAPI to Open WebUI

What good would this guide be without showing how the API can work with another platform? As a bonus, I'll help you get Kokoro FastAPI connected to Open WebUI so you can use it to read aloud your conversations.

In the Open WebUI Admin panel, go to Settings
Click Audio
Change the TTS Settings to use OpenAI
Add the IP and port that Kokoro FastAPI is running on with a /v1 so it looks like this: http://192.168.1.45:8880/v1 (use your server IP)
Password is: not-needed
Choose a voice from the Kokoro voices (You can see the names in the web ui)
Choose the tts-1-hs TTS Model
Click Save

Pretty simple! Now you can click the speaker icon on any output in Open WebUI to have the voices from Kokoro FastAPI read them aloud to you.

Stats for Nerds

As I could not get the GPU version of Kokoro FastAPI running on my system due to outdated CUDA, I was forced to use the CPU version. And much to my surprise it is much faster than I anticipated.

Final Notes and Thoughts

Another fantastic self hosted TTS application. I had a lot of fun with this one and was pleasantly surprised how well it ran on my CPU. Granted, I do have a pretty beefy CPU but nonetheless, it was a fun experience.

Kokoro FastAPI does not have as many options and customizations like Zonos does, but it still presents a very great start to a TTS platform that sounds incredibly realistic!

Keep in mind these projects are under active deployment and you may run into issues like I did with the voice weights etc. If you do encounter issues, please use the Kokoro FastAPI github issue tracker.

Someone created another project to use this with Home Assistant

Kokoro FastAPI - Self Hosted Text to Speech Platform Installation Guide

What is Kokoro TTS?

What is the Kokoro FastAPI Wrapper?

Kokoro FastAPI Features

Installing Kokoro FastAPI using Docker

Kokoro FastAPI Web Interface

Connect Kokoro FastAPI to Open WebUI

Stats for Nerds

Final Notes and Thoughts

Read Next

How I Built a Voice-First AI Mirror You Can Run at Home

Romm - Self-Hosted ROM Manager with EmulatorJS Baked In

Block Everything You Hate Online with AdGuard Home

EverShelf - The AI-Powered, Self-Hosted Inventory Brain for Your Kitchen

Issued - Small, Fast, Self-Hosted Comic Library Server

Why I Built Vykar Backup: A Faster, Simpler Rust Backup Tool

Scanopy: Self-Hosted Network Scanner That Builds a Live Topology Map

OpenDroneLog: A Self Hosted DJI Flight Log Dashboard

HarborFM - Self-Hosted Podcast Creator

Self-Hosted Push Notifications with Ntfy on iOS

Subscribe to Noted