selim
onder.ai
selim
@onder.ai
research engineer/audio
Results: SOTA across 10+ benchmarks. More importantly - moved from basic recognition to musician-level understanding. The model connects surface attributes to mid-level structures (chord progressions, vocal phrasing) to higher-level meaning (emotional trajectory, lyrical context).
November 15, 2025 at 1:31 PM
Their approach: MF-Skills dataset with rich captions covering harmony, structure, timbre, lyrics, cultural context. Fine-tuned Audio Flamingo 3, then added MF-Think (chain-of-thought grounded in music theory) + GRPO reinforcement learning.
November 15, 2025 at 1:31 PM
The challenge: music is layered, information-dense, and prior models only captured superficial details (genre, tempo, instruments). Limited by data scarcity and weak annotations.
November 15, 2025 at 1:31 PM
this also depends on the tokenization, right? with a scheme where each character is its own token, tabs and spaces cost the same. but gpt-style tokenizers use merges, so tabs and spaces can take different amounts of space depending on the context.
November 15, 2025 at 1:22 PM
hackernews frontpage!
October 25, 2025 at 12:27 AM
I guess its just relatively cheap to do real world ads
October 11, 2025 at 1:57 PM
We’ll synthesize the “avocado chairs” of speech - vocal expressions that have never existed, impossible speaker combinations, entirely new forms of human-like communication. Not just better TTS, but a new creative medium.
September 3, 2025 at 10:37 PM
Once audio language models can accurately describe all audio (not just speech), audio world models will emerge that outperform any specialized TTS system. The same way DALL-E/Midjourney made face-specific generators obsolete.
September 3, 2025 at 10:37 PM
Today’s TTS models are incredible, but they’re trained primarily on speech data. Here’s the thing: speech data alone can’t capture true human communication - the sighs, laughs, breathing patterns, all those nonverbal nuances that make us human.
September 3, 2025 at 10:37 PM
Think about image gen’s evolution: We started with constrained domains (faces with StyleGAN), then suddenly had models creating “avocado chairs” - concepts that never existed. This happened when we moved from specialized to general world models.
September 3, 2025 at 10:37 PM
Evaluating generative audio is hard. Metrics like CER and WER quantify transcription errors, but completely ignore if the audio feels real or contextually appropriate. There's no good automated metric yet for audio realness. WavReward provides an alternative way for eval!
May 17, 2025 at 3:42 PM
And I predict that in-context learning with audio language models will become a hot topic—especially in defining new evaluation methods to assess the quality of generated audio, improving the precision of dataset labeling, and much more.
May 10, 2025 at 6:20 AM
Code + weights are public, runs on budget GPUs, and early demos already score on par—or better—than ElevenLabs, the current paid heavyweight.

I think this is a great opportunity for audio/speech researchers who wants to do some cool post training stuff.
April 26, 2025 at 11:29 AM
Dia ditches the extra NAR stage used in VALL-E-style stacks. It instead models all quantizers together and predict each residual parallely.
April 26, 2025 at 11:29 AM
Audio is discretised with the high-fidelity Descript residual VQ tokenizer, pumping out 86 tokens / sec per residual. It is one of the best sounding codec indeed, bit it does produce long token sequences and well as many residuals. Dia eases this load by smartly delayed pattern modeling.
April 26, 2025 at 11:29 AM