Adding Voice to your self-hosted AI Stack
I’ve recently found that AI and small language models particularly useful for doing boring jobs like transcribing handwritten notes and speech to text. OpenAI is probably most well known for their GPT series large language models. However, one of their biggest contributions that has consistently flown under the radar for most people is their whisper speech-to-text model. Whisper is a really great model that is open and free to use and can run with a relatively small memory footprint. I’ve found that Whisper is incredibly useful for allowing me to just dictate my notes, thoughts and feelings.
If you’re so inclined, you can also use technologies like Whisper to verbally converse with a large language model. Some people use these tools for brainstorming and talking to the model while they’re out and about. ChatGPT, allows you to do this, but of course everything you say is being shared with OpenAI.
In this post, I’m going to show you how to set up a speech-to-text and text-to-speech pipeline as part of your self-hosted AI infrastructure, building on my previous article, which you can find here.
## Prerequisites
This post assumes that you already have OpenWebUI, LiteLLM, and Ollama setup, just like the setup that I described in my earlier blog post on the subject. I’m also going to assume that you have a GPU with enough VRAM to run these new additional models, as well as the large language model that you want to talk to. You’ll be able to have an audio conversation with a model in a completely self-hosted setup without ever sending any data to OpenAI or other companies.
To give you an idea of what’s possible, my full stack with speech to text and text to speech and a Llama 3.1 8 billion parameter model all runs on a single NVidia 3060 graphics card with 12GB of VRAM. If you’re looking to talk with larger, more capable models, like Gemma 27b for example, you might need a larger graphics card or a separate machine to run the language model.
## Updated Stack Architecture
In this post, we’re going to introduce a new component into the existing stack. This component is called Speaches (formerly, faster-whisper-server). It provides speech-to-text via Whisper models and text-to-speech capabilities via Kokoro-82M and piper.
Since LiteLLM also supports audio models, we are going to hook speeches up to LiteLLM and we should be able to serve both STT and TTS capabilities through our Caddy reverse proxy out to users on the internet. We can also optionally hook up these capabilities to OpenWebUI, which will allow us to talk to locally hosted language models using our voice.
graph TD
subgraph "Server"
subgraph "Docker Containers"
OW[OpenWebUI]
SP[Speaches]
OL[Ollama]
LL[LiteLLM]
end
end
subgraph "Internet"
USER[Internet Users]
end
%% External connections
USER -->|HTTPS| CADDY[Caddy]
CADDY -->|Reverse Proxy| OW
CADDY -->|Reverse Proxy| LL
%% Internal connections
OW -->|API Calls| LL
SP -->|API| LL
OL -->|API| LL
%% Connection styling
classDef docker fill:#1D63ED,color:white;
classDef internet fill:#27AE60,color:white;
classDef proxy fill:#F39C12,color:white;
class OW,SP,OL,LL docker;
class USER internet;
class CADDY proxy;
## Adding Speaches to Docker Compose
If you’ve already followed my previous post, you should have a Docker Compose YAML with all the services that already exist on your system set up and defined.
We are going to add a new service definition for speaches to this file:
services:
# ...
# your other services like ollama...
# ...
speaches:
container_name: speaches
restart: unless-stopped
ports:
- 8014:8000
healthcheck:
test: ["CMD", "curl", "--fail", "http://0.0.0.0:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 5s
# NOTE: slightly older cuda version is available under 'latest-cuda-12.4.1' tag
image: ghcr.io/speaches-ai/speaches:latest-cuda
environment:
- WHISPER__COMPUTE_TYPE=int8
- WHISPER__TTL=-1
- LOOPBACK_HOST_URL=http://192.168.1.123:8014
volumes:
- ./hf-cache:/home/ubuntu/.cache/huggingface/hub
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
**Key details from this step:**
* We are pulling the `latest-cuda` build. If you find that you have problems, check that your cuda runtime is not old/out of date. You can check this by running `nvidia-smi -q | grep 'CUDA'`. They do offer builds for older runtimes.
* The `LOOPBACK_HOST_URL` is used to tell the app running inside the container what the host machine’s IP address is.
* I’m passing `WHISPER__COMPUTE_TYPE=int8` to quantize the models to 8 bit. You can try other options but you may find that it takes up more memory and inference takes longer.
* `WHISPER__TTL=-1` forces the server to keep the model loaded in memory all the time. This is usually desirable if you have enough VRAM since loading the model can take a few seconds. If the model is in VRAM I usually get realtime transcription. It’s lightning fast.
* I mapped port `8014` on my host machine to port `8000` that the app runs on inside the container. You can use any free TCP port, it doesn’t have to be 8014.
* We persist the huggingface cache directory to disk so that we don’t have to re-download the models every time the container restarts.
## First Run
Once you’ve added the service, we can simply execute `docker compose up -d speaches` to get it running for the first time.
We can test the transcription service by uploading an audio file. Try recording a short voice clip or converting a short video from youtube using a service like this one
curl http://<server_ip>:8014/v1/audio/transcriptions -F "file=@/path/to/file/audio.wav"
The first time you do this it could take a little while since speaches will have to download the models from Huggingface. Then you’ll get some JSON output containing the transcript. Here’s an example from when I ran this command using a simpsons audio clip I found on youtube.
> curl http://myserver.local:8014/v1/audio/transcriptions -F "file=@/home/james/Downloads/Bart Simpson Ay Caramba.wav"
{"text":"Barg, you really shouldn't be looking through other people's things. Find anything good? I said it before and I'll say it again. Ay, carumba! Elise, bang bang! Aw, Barg, that's a blackhead gun! Eww!"}%
If you plan to use TTS, you will need to follow some extra steps documented on the speaches website next:
export KOKORO_REVISION=c97b7bbc3e60f447383c79b2f94fee861ff156ac
# Download the ONNX model (~346 MBs)
docker exec -it speaches huggingface-cli download hexgrad/Kokoro-82M --include 'kokoro-v0_19.onnx' --revision $KOKORO_REVISION
# Download the voices.bin (~5.5 MBs) file
docker exec -it speaches curl --location --output /home/ubuntu/.cache/huggingface/hub/models--hexgrad--Kokoro-82M/snapshots/$KOKORO_REVISION/voices.bin https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/voices.bin
If you would prefer to use the piper series of models, you can run **one** of the following commands to download voice models for it:
# Download all voices (~15 minutes / 7.7 GBs)
docker exec -it speaches huggingface-cli download rhasspy/piper-voices
# Download all English voices (~4.5 minutes)
docker exec -it speaches huggingface-cli download rhasspy/piper-voices --include 'en/**/*' 'voices.json'
# Download all qualities of a specific voice (~4 seconds)
docker exec -it speaches huggingface-cli download rhasspy/piper-voices --include 'en/en_US/amy/**/*' 'voices.json'
# Download specific quality of a specific voice (~2 seconds)
docker exec -it speaches huggingface-cli download rhasspy/piper-voices --include 'en/en_US/amy/medium/*' 'voices.json'
we can test that it worked by running a request against the speech endpoint:
curl http://myserver.local:8014/v1/audio/speech --header "Content-Type: application/json" --data '{"input": "Hello World!"}' --output audio.mp3
## Adding Models to LiteLLM
Next we need to add the audio models to LiteLLM. We’re going to edit the existing `config.yaml` file and add the two new models:
model_list:
# ...
# your other models go here...
# ...
- model_name: whisper
litellm_params:
model: openai/Systran/faster-whisper-large-v3
api_base: http://192.168.1.123:8014/v1
model_info:
mode: audio_transcription
- model_name: Kokoro-82M
litellm_params:
model: openai/hexgrad/Kokoro-82M
api_base: http://192.168.1.123:8014/v1
- model_name: piper
litellm_params:
model: openai/hexgrad/Kokoro-82M
api_base: http://192.168.1.123:8014/v1
Once you restart litellm, you should now be able to run the same tests from the previous section but using your litellm endpoint and credentials instead.
Testing Speech to Text with LiteLLM:
curl https://litellm.yoursite.example/v1/audio/transcriptions \
-H "Authorization: Bearer sk-your-token" \
-F model=whisper \
-F "file=@/home/james/Downloads/Bart Simpson Ay Caramba.wav"
Testing Text to Speech with LiteLLM:
curl https://litellm.yoursite.example/v1/audio/transcriptions \
-H "Authorization: Bearer sk-your-token" \
-H "Content-Type: application/json" \
--data '{"model":"Kokoro-82M", "input": "Hello World! ROFLMAO", "voice":"bf_isabella", "language":"en_gb"}' \
-o test.wav
## Connecting OpenWebUI to LiteLLM
To add voice capability to OpenWeb UI, log in as the admin user (by default the first one you would have set up when you installed OpenWebUI, click on your username in the bottom left hand corner of the screen. Go to Admin Panel.
Navigate to the Settings Tab and then to Audio. Here you can populate the STT and TTS settings. In
* You can use the same endpoint for both - it is your litellm instance URL with `/v1` appended e.g. `https://litellm.mydomain.example/v1`
* You can use the same litellm API key for both - I like having different keys for different apps so that I can see usage across my software stack but you can also just use litellm’s admin user password as a key if you prefer.
* * For the STT model enter the corresponding name from the yaml - `whisper` in the example above
* For TTS model enter the model name you used in the litellm config either `piper` or `Kokoro-82M` in the example above
* In the TTS settings you also need to specify a voice. A full list of available voices can be found by going to the speaches demo gradio app (likely running on `http://yourserver.local:8014` ) and looking at the Text-To-Speech tab
*
My personal preference is `bf_isabella` with `Kokoro-82M` or the `en_GB-alba-medium` voice and `piper` model. I tend to stick to piper at the moment due to a weird quirk/bug I found with the voice prosody and pronunciation (see below)
## Testing Calls
To talk to a model, go to OpenWebUI, select the model you want to interact with and click the call icon.
In this mode, OpenWebUI will attempt to “listen” to your microphone and pass the audio to the whisper endpoint. When you stop talking, whisper will indicate a break and what you said so far will be processed by the language model. A response is generated and passed to the TTS endpoint before it is played back to you.
I noticed that this doesn’t always work perfectly in Firefox Mobile, it seems to get stuck and not play back the response. However, Chromium based browsers seem to get this right.
## Prosody and Language Quirk/Bug
As of writing there is a weird quirk with speaches where the `Kokoro-82M` UK english voices will default to american prosody/pronunciation of words if you do not specifically set the language as part of your request. For example, try running the following command against your own server and you’ll see that the model pronounces “common” as “carmen” and problem as “pra-blem” which sounds weird in an English accent.
curl https://litellm.yoursite.example/v1/audio/transcriptions \
-H "Authorization: Bearer sk-your-token" \
-H "Content-Type: application/json" \
--data '{"model":"Kokoro-82M", "input": "This is quite a common problem", "voice":"bf_isabella"}' \
-o test.wav
Unfortunately OpenWebUI does not currently have an option for passing the user’s language preference to the model which means that the pronunciation is always off for me.
There are a couple possible solutions I can think of:
1. Have OpenWebUI pass a `language` param when it is making TTS API calls to litellm.
2. Have Speaches map the language of the voice automatically. Voices in Kokoro have a naming convention with the country of origin and a gender attached (e.g. american female voices prefixed `af`, british male voices prefixed `bm` and so on.) Speaches could infer the language from that. Alternatively, language could be stored in the metadata somewhere.
I might make some pull requests if I’m feeling cute later.
## Conclusion
I’m pretty blown away by how accurate and realistic these small models that run on a single consumer GPU can be. It’s really useful to be able to have a full speech-to-text and text-to-speech stack running locally and not have to worry about privacy. If you wanted to, you could swap out the language model in this stack for an externally hosted one like Claude Sonnet or a groq-hosted open-ish model like Llama 3.3 70b. You could even make ironic use of GPT-4o.
In future articles I’ll be covering some other use cases for these tools including external voice transcription apps and tools that can plug in to your whisper API and my self-hosted home assistant stack which I am using to replace Alexa with fully self-hosted home automation tooling.