selim
onder.ai
selim
@onder.ai
research engineer/audio
arxiv.org/abs/2511.10289

Music Flamingo from NVIDIA - scaling music understanding in audio-language models beyond surface-level captions.
Music Flamingo: Scaling Music Understanding in Audio Language Models
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapi...
arxiv.org
November 15, 2025 at 1:31 PM
Notion AI looks like Pringles after a rebrand
October 2, 2025 at 4:58 PM
Actual audio world models are getting closer and closer.
I believe speech synthesis is following the exact same path image generation took - and we’re about to see the same explosive leap.
September 3, 2025 at 10:37 PM
The amazing coincidence: I wondered whether Udio has an iOS app just a few days back, but there was none. Today they released their first iOS app 😸
May 22, 2025 at 6:56 AM
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

arxiv.org/abs/2505.09558
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance ...
arxiv.org
May 17, 2025 at 3:42 PM
I’ve been testing Gemini 2.5 for various audio understanding tasks. I must admit, I’m excited to see that what GPT-2 did for text is now happening in audio: fewer specialist models, more generalist ones.
May 10, 2025 at 6:18 AM
Brandon's Semiconductor Simulator

brandonli.net/semisim/

(from hn)
Semiconductor simulator
brandonli.net
May 10, 2025 at 4:21 AM
Reposted by selim
How do memories last a lifetime when the molecules that form them turn over within days, weeks or months? An interaction between two proteins points to a molecular basis for memory. @ajdinahalilovic.bsky.social reports: www.quantamagazine.org/the-molecula...
The Molecular Bond That Helps Secure Your Memories | Quanta Magazine
How do memories last a lifetime when the molecules that form them turn over within days, weeks or months? An interaction between two proteins points to a molecular basis for memory.
www.quantamagazine.org
May 7, 2025 at 1:51 PM
Say hello to Dia—the newest codec-LM on the block. Fully open-sourced and already turning heads by nailing speech and the tricky stuff: laughs, coughs, hesitations, breath-intakes.

github.com/nari-labs/dia 🧵
GitHub - nari-labs/dia: A TTS model capable of generating ultra-realistic dialogue in one pass.
A TTS model capable of generating ultra-realistic dialogue in one pass. - nari-labs/dia
github.com
April 26, 2025 at 11:29 AM
udio's allegro 1.5 model impressively fast while generating decent musical outputs.

Very good release by udio, as one of the lacking features against Suno competition was ability to generate fast.

im still unsure about the trade-off between models but believing 1.5 is unmatched at "Ultra" quality
April 20, 2025 at 7:10 PM
Reposted by selim
Veo 2でGeminiとWhiskにビデオを生成する
Generate videos in Gemini and Whisk with Veo 2

🔺 334
💬 23
🔗 HN Post | Article
Generate videos in Gemini and Whisk with Veo 2
You can now generate videos in Gemini, powered by Veo 2.
blog.google
April 16, 2025 at 3:43 PM
Udio Music leads generative music despite Suno's competition. Suno is fast with high-quality audio but lacks musical merit. Udio delivers both quality audio and good music. If Udio shortened its experimental cycle without sacrificing musical merit, it could set the field to a new level.
March 15, 2025 at 7:37 PM
TAAE - A new audio tokenizer drop

paper: arxiv.org/abs/2411.198...

They built a pure transformer codec (1B params) with an FSQ bottleneck instead of the usual RVQ approach. At 400-700 bits per second, it produces extremely high quality speech - getting close to the original audio.
December 3, 2024 at 1:30 AM
1/13
Two years ago, Microsoft Speech published VALL-E, one of the most influential works that helped shift speech synthesis to a whole new scale. Last year, the same group published VALL-E 2, which is, in my opinion, a perfect "lessons learned" paper.
December 1, 2024 at 4:22 PM

Vision Language Models Are Few-Shot Audio Spectrogram Classifiers

paper: openreview.net/forum?id=RnB...
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
We demonstrate that vision language models (VLMs) are capable of recognizing the content in audio recordings when given corresponding spectrogram images. Specifically, we instruct VLMs to perform...
openreview.net
November 23, 2024 at 8:06 PM