Lightnews — Scholar-powered news

Kyutai

@kyutai-labs.bsky.social

Available in PyTorch, MLX, on your iPhone, or in Rust for your server needs!
Project Page: kyutai.org/next/stt
OpenASR Leaderboard: huggingface.co/spaces/hf-au...

June 27, 2025 at 10:31 AM

Kyutai

@kyutai-labs.bsky.social

What’s next? We strongly believe that the future of human-machine interaction lies in natural, full-duplex speech interactions, coupled with customization and extended abilities. Stay tuned for what’s to come!

May 23, 2025 at 10:14 AM

Kyutai

@kyutai-labs.bsky.social

The text LLM’s response is passed to our TTS, conditioned on a 10s voice sample. We’ll provide access to the voice cloning model in a controlled way. The TTS is also streaming *in text*, reducing the latency by starting to speak even before the full text response is generated.

May 23, 2025 at 10:14 AM

Kyutai

@kyutai-labs.bsky.social

Unmute’s speech-to-text is streaming, accurate, and includes a semantic VAD that predicts whether you’ve actually finished speaking or if you’re just pausing mid-sentence, meaning it’s low-latency but doesn’t interrupt you.

May 23, 2025 at 10:14 AM

Kyutai

@kyutai-labs.bsky.social

“But what about Moshi?” While Moshi provides unmatched latency and naturalness, it doesn’t yet match the abilities of text models such as function-calling, stronger reasoning, and in-context learning. Unmute allows us to directly bring all of these from text to real-time voice conversations.

May 23, 2025 at 10:14 AM

Kyutai

@kyutai-labs.bsky.social

🧑‍💻 Read more about Helium 1 and dactory on our blog: kyutai.org/2025/04/30/h...
🤗 Get the models on HuggingFace: huggingface.co/kyutai/heliu...
📚 Try our pretraining data pipeline on GitHub: github.com/kyutai-labs/...

May 5, 2025 at 10:39 AM

Kyutai

@kyutai-labs.bsky.social

If you have audio data with speaker separated streams 🗣️🎙️🎤🤖 head over to github.com/kyutai-labs/moshi-finetune and train your own Moshi! We have already witnessed nice extensions of Moshi like J-Moshi 🇯🇵 hope this release will allow more people to create their very own voice AI!

GitHub - kyutai-labs/moshi-finetune

Contribute to kyutai-labs/moshi-finetune development by creating an account on GitHub.

github.com

April 1, 2025 at 3:47 PM

Kyutai

@kyutai-labs.bsky.social

Fine-tuning Moshi only takes a couple hours and can be done on a single GPU thanks to LoRA ⚡. The codebase contains an example colab notebook that demonstrates the simplicity and the efficiency of the procedure 🎮.
🔎 github.com/kyutai-labs/...

April 1, 2025 at 3:47 PM

Kyutai

@kyutai-labs.bsky.social

If you want to work on cutting-edge research, join our non-profit AI lab in Paris 🇫🇷

Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences — and the open-source community.

March 21, 2025 at 2:39 PM

Kyutai

@kyutai-labs.bsky.social

🧰 Fully open-source

We’re releasing a preprint, model weights and a benchmark dataset for spoken visual question answering:

📄 Preprint arxiv.org/abs/2503.15633
🧠 Dataset huggingface.co/datasets/kyu...
🧾 Model weights huggingface.co/kyutai/moshi...
🧪 Inference code github.com/kyutai-labs/...

March 21, 2025 at 2:39 PM

Kyutai

@kyutai-labs.bsky.social

🧠 How it works

MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision.

206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware.

March 21, 2025 at 2:39 PM

Kyutai

@kyutai-labs.bsky.social

Try it out 👉 vis.moshi.chat
Blog post 👉 kyutai.org/moshivis

March 21, 2025 at 2:39 PM

Kyutai

@kyutai-labs.bsky.social

Get the code on github and the weights on huggingface and try it out by yourself: github.com/kyutai-labs/...

GitHub - kyutai-labs/hibiki: Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance...

Hibiki is a model for streaming speech translation (also known as simultaneous translation). Unlike offline translation—where one waits for the end of the source utterance to start translating--- H...

github.com

February 7, 2025 at 8:22 AM

Kyutai

@kyutai-labs.bsky.social

Hibiki’s smaller alternative, Hibiki-M, runs on-device in real time. Hibiki-M was obtained by distilling the full model into a smaller version with only 1.7B parameters. On an iPhone 16 Pro, Hibiki-M runs in real-time for more than a minute as shown by Tom.

February 7, 2025 at 8:22 AM

Kyutai

@kyutai-labs.bsky.social

To train Hibiki, we generated bilingual data of simultaneous interpretation where a word only appears in the target when it's predictable from the source. We developed a new method based on an off-the-shelf text translation system and using a TTS system with constraints on word locations.

February 7, 2025 at 8:22 AM

Kyutai

@kyutai-labs.bsky.social

Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters.
Here is an example of a live conference interpretation.

February 7, 2025 at 8:22 AM

Kyutai

@kyutai-labs.bsky.social

We are looking forward to the feedback from the community, which will help us drive the development of Helium and make it the best multi-lingual lightweight model. Thanks @hf.co for helping us on this release!

January 13, 2025 at 5:51 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news