Project Page: kyutai.org/next/stt
OpenASR Leaderboard: huggingface.co/spaces/hf-au...
Project Page: kyutai.org/next/stt
OpenASR Leaderboard: huggingface.co/spaces/hf-au...
🤗 Get the models on HuggingFace: huggingface.co/kyutai/heliu...
📚 Try our pretraining data pipeline on GitHub: github.com/kyutai-labs/...
🤗 Get the models on HuggingFace: huggingface.co/kyutai/heliu...
📚 Try our pretraining data pipeline on GitHub: github.com/kyutai-labs/...
🔎 github.com/kyutai-labs/...
🔎 github.com/kyutai-labs/...
Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences — and the open-source community.
Thanks to Iliad Group, CMA-CGM Group, Schmidt Sciences — and the open-source community.
We’re releasing a preprint, model weights and a benchmark dataset for spoken visual question answering:
📄 Preprint arxiv.org/abs/2503.15633
🧠 Dataset huggingface.co/datasets/kyu...
🧾 Model weights huggingface.co/kyutai/moshi...
🧪 Inference code github.com/kyutai-labs/...
We’re releasing a preprint, model weights and a benchmark dataset for spoken visual question answering:
📄 Preprint arxiv.org/abs/2503.15633
🧠 Dataset huggingface.co/datasets/kyu...
🧾 Model weights huggingface.co/kyutai/moshi...
🧪 Inference code github.com/kyutai-labs/...
MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision.
206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware.
MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision.
206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware.
Here is an example of a live conference interpretation.
Here is an example of a live conference interpretation.