We’re releasing a preprint, model weights and a benchmark dataset for spoken visual question answering:
📄 Preprint arxiv.org/abs/2503.15633
🧠 Dataset huggingface.co/datasets/kyu...
🧾 Model weights huggingface.co/kyutai/moshi...
🧪 Inference code github.com/kyutai-labs/...
We’re releasing a preprint, model weights and a benchmark dataset for spoken visual question answering:
📄 Preprint arxiv.org/abs/2503.15633
🧠 Dataset huggingface.co/datasets/kyu...
🧾 Model weights huggingface.co/kyutai/moshi...
🧪 Inference code github.com/kyutai-labs/...
MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision.
206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware.
MoshiVis builds on Moshi, our speech-to-speech LLM — now enhanced with vision.
206M lightweight parameters on top of a frozen Moshi give it the power to discuss images while still remaining real-time on consumer-grade hardware.
It sees, understands, and talks about images — naturally, and out loud.
This opens up new applications, from audio description for the visual impaired to visual access to information.
It sees, understands, and talks about images — naturally, and out loud.
This opens up new applications, from audio description for the visual impaired to visual access to information.
Here is an example of a live conference interpretation.
Here is an example of a live conference interpretation.
Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. 🧵
Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. 🧵