flux9665.bsky.social
@flux9665.bsky.social
Reposted
KyutaiTTS solved streaming text-to-speech with a state machine that generates audio word-by-word as text arrives.

220ms latency, 10-second voice cloning, 32 concurrent users on single GPU.

No more waiting for complete sentences.

Full analysis: erogol.substack.com/p/model-chec...
Model check - KyutaiTTS: Streaming Text-to-Speech with Delayed Streams Modeling
Going over the Kyutai's new TTS model and its delayed streaming model.
erogol.substack.com
August 2, 2025 at 7:46 PM