Eugene Kharitonov
n0mad-0.bsky.social
Eugene Kharitonov
@n0mad-0.bsky.social
Technical Staff at @kyutai.org. Previously Google Deep mind, Meta AI Research. CS PhD.
Reposted by Eugene Kharitonov
Our latest open-source speech-to-text model just claimed 1st place among streaming models and 5th place overall on the OpenASR leaderboard 🥇🎙️
While all other models need the whole audio, ours delivers top-tier accuracy on streaming content.
Open, fast, and ready for production!
June 27, 2025 at 10:31 AM
Reposted by Eugene Kharitonov
Have you enjoyed talking to 🟢Moshi and dreamt of making your own speech to speech chat experience🧑‍🔬🤖? It's now possible with the moshi-finetune codebase! Plug your own dataset and change the voice/tone/personality of Moshi 💚🔌💿. An example after finetuning w/ only 20 hours of the DailyTalk dataset. 🧵
April 1, 2025 at 3:47 PM
Reposted by Eugene Kharitonov
Just back from holidays, so a bit late, to announce MoshiVis, extending Moshi's multimodal capabilities to take in images 📷.
Only 200M weights were added to plug a ViT through cross attention with gating 🖼️🔀🎤
Training relies on a mix of text only and text+audio synthetic data (~20k hours) 💽
Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images!

It sees, understands, and talks about images — naturally, and out loud.

This opens up new applications, from audio description for the visual impaired to visual access to information.
March 31, 2025 at 10:06 AM
Hello 🌎!
March 15, 2025 at 7:13 AM