It is a text LLM wrapper, based on in-house streaming ASR, TTS, semantic VAD to reduce latency. ⏱️
Unlike Moshi 🟢, Unmute 🔊 is turn base, but allows customization in two clicks🖱️: voice and prompt!
Paper and open source coming soon.
It is a text LLM wrapper, based on in-house streaming ASR, TTS, semantic VAD to reduce latency. ⏱️
Unlike Moshi 🟢, Unmute 🔊 is turn base, but allows customization in two clicks🖱️: voice and prompt!
Paper and open source coming soon.
Only 200M weights were added to plug a ViT through cross attention with gating 🖼️🔀🎤
Training relies on a mix of text only and text+audio synthetic data (~20k hours) 💽
It sees, understands, and talks about images — naturally, and out loud.
This opens up new applications, from audio description for the visual impaired to visual access to information.
Only 200M weights were added to plug a ViT through cross attention with gating 🖼️🔀🎤
Training relies on a mix of text only and text+audio synthetic data (~20k hours) 💽
See you there!
📅 13th of March 🕰️ 11am ET, 4pm in Paris.
I'll discuss Mimi 🗜️ and multi-stream audio modeling 🔊.
Join on Zoom, replay on YT.
⬛ ⬛ 🟧 🟧 🟨 🟨 🟩 🟩 🟩 ⬛
⬛ 🟧 🟧 🟨 🟨 🟩 🟩 🟩 ⬛ ⬛
📅 Thursday, March 13 | 11 AM - 12 PM EST
🎙Speaker: Alexandre Defossez
📖 Topic: "Moshi: a speech-text foundation model for real-time dialogue"
🔗 Details: (poonehmousavi.github.io/rg)
▶️ Missed a session? Watch on YouTube: (www.youtube.com/@CONVAI_RG) 🚀
See you there!
📅 13th of March 🕰️ 11am ET, 4pm in Paris.
I'll discuss Mimi 🗜️ and multi-stream audio modeling 🔊.
Join on Zoom, replay on YT.
⬛ ⬛ 🟧 🟧 🟨 🟨 🟩 🟩 🟩 ⬛
⬛ 🟧 🟧 🟨 🟨 🟩 🟩 🟩 ⬛ ⬛
📅 Thursday, March 13 | 11 AM - 12 PM EST
🎙Speaker: Alexandre Defossez
📖 Topic: "Moshi: a speech-text foundation model for real-time dialogue"
🔗 Details: (poonehmousavi.github.io/rg)
▶️ Missed a session? Watch on YouTube: (www.youtube.com/@CONVAI_RG) 🚀
📅 13th of March 🕰️ 11am ET, 4pm in Paris.
I'll discuss Mimi 🗜️ and multi-stream audio modeling 🔊.
Join on Zoom, replay on YT.
⬛ ⬛ 🟧 🟧 🟨 🟨 🟩 🟩 🟩 ⬛
⬛ 🟧 🟧 🟨 🟨 🟩 🟩 🟩 ⬛ ⬛
following next @yann-lecun.bsky.social and so many humbling figures of AI:
www.france.tv/documentaire...
following next @yann-lecun.bsky.social and so many humbling figures of AI:
www.france.tv/documentaire...
www.technologyreview.com/2025/02/07/1...
www.technologyreview.com/2025/02/07/1...
Link: arxiv.org/abs/2502.02996
1/8
Link: arxiv.org/abs/2502.02996
1/8
We leverage a large synthetic corpus synthesized from the text translation model MADLAD, and our own TTS + simple lag rule.
Model is decoder only, runs at scale, even on device 📲
github.com/kyutai-labs/hibiki
Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech. 🧵
We leverage a large synthetic corpus synthesized from the text translation model MADLAD, and our own TTS + simple lag rule.
Model is decoder only, runs at scale, even on device 📲
github.com/kyutai-labs/hibiki
What: masters internship and/or PhD positions
Where: Rothschild Foundation Hospital (Paris, France)
Topic: AI and Neuroscience
Supervised by: Pierre Bourdillon and myself
Apply here: forms.gle/KKnea2QAjhAe...
Deadline: Feb 5th
What: masters internship and/or PhD positions
Where: Rothschild Foundation Hospital (Paris, France)
Topic: AI and Neuroscience
Supervised by: Pierre Bourdillon and myself
Apply here: forms.gle/KKnea2QAjhAe...
Deadline: Feb 5th
On HF, under CC-BY licence: huggingface.co/kyutai/heliu...
On HF, under CC-BY licence: huggingface.co/kyutai/heliu...