Lightnews — Scholar-powered news

Actual audio world models are getting closer and closer.
I believe speech synthesis is following the exact same path image generation took - and we’re about to see the same explosive leap.

September 3, 2025 at 10:37 PM

selim

@onder.ai

The amazing coincidence: I wondered whether Udio has an iOS app just a few days back, but there was none. Today they released their first iOS app 😸

May 22, 2025 at 6:56 AM

selim

@onder.ai

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

arxiv.org/abs/2505.09558

WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance ...

arxiv.org

May 17, 2025 at 3:42 PM

selim

@onder.ai

I’ve been testing Gemini 2.5 for various audio understanding tasks. I must admit, I’m excited to see that what GPT-2 did for text is now happening in audio: fewer specialist models, more generalist ones.

May 10, 2025 at 6:18 AM

selim

@onder.ai

Brandon's Semiconductor Simulator

brandonli.net/semisim/

(from hn)

Semiconductor simulator

brandonli.net

May 10, 2025 at 4:21 AM

Reposted by selim

Quanta Magazine

@quantamagazine.bsky.social

How do memories last a lifetime when the molecules that form them turn over within days, weeks or months? An interaction between two proteins points to a molecular basis for memory. @ajdinahalilovic.bsky.social reports: www.quantamagazine.org/the-molecula...

The Molecular Bond That Helps Secure Your Memories | Quanta Magazine

How do memories last a lifetime when the molecules that form them turn over within days, weeks or months? An interaction between two proteins points to a molecular basis for memory.

www.quantamagazine.org

May 7, 2025 at 1:51 PM

selim

@onder.ai

Say hello to Dia—the newest codec-LM on the block. Fully open-sourced and already turning heads by nailing speech and the tricky stuff: laughs, coughs, hesitations, breath-intakes.

github.com/nari-labs/dia 🧵

GitHub - nari-labs/dia: A TTS model capable of generating ultra-realistic dialogue in one pass.

A TTS model capable of generating ultra-realistic dialogue in one pass. - nari-labs/dia

github.com

April 26, 2025 at 11:29 AM

selim

@onder.ai

udio's allegro 1.5 model impressively fast while generating decent musical outputs.

Very good release by udio, as one of the lacking features against Suno competition was ability to generate fast.

im still unsure about the trade-off between models but believing 1.5 is unmatched at "Ultra" quality

April 20, 2025 at 7:10 PM

Reposted by selim

Hacker News JP 🤖

@hacker-news-jp.bsky.social

Veo 2でGeminiとWhiskにビデオを生成する
Generate videos in Gemini and Whisk with Veo 2

🔺 334
💬 23
🔗 HN Post | Article

Generate videos in Gemini and Whisk with Veo 2

You can now generate videos in Gemini, powered by Veo 2.

blog.google

April 16, 2025 at 3:43 PM

selim

@onder.ai

Udio Music leads generative music despite Suno's competition. Suno is fast with high-quality audio but lacks musical merit. Udio delivers both quality audio and good music. If Udio shortened its experimental cycle without sacrificing musical merit, it could set the field to a new level.

March 15, 2025 at 7:37 PM

selim

@onder.ai

TAAE - A new audio tokenizer drop

paper: arxiv.org/abs/2411.198...

They built a pure transformer codec (1B params) with an FSQ bottleneck instead of the usual RVQ approach. At 400-700 bits per second, it produces extremely high quality speech - getting close to the original audio.

December 3, 2024 at 1:30 AM

selim

@onder.ai

1/13
Two years ago, Microsoft Speech published VALL-E, one of the most influential works that helped shift speech synthesis to a whole new scale. Last year, the same group published VALL-E 2, which is, in my opinion, a perfect "lessons learned" paper.

December 1, 2024 at 4:22 PM

selim

@onder.ai

Vision Language Models Are Few-Shot Audio Spectrogram Classifiers

paper: openreview.net/forum?id=RnB...

Vision Language Models Are Few-Shot Audio Spectrogram Classifiers

We demonstrate that vision language models (VLMs) are capable of recognizing the content in audio recordings when given corresponding spectrogram images. Specifically, we instruct VLMs to perform...

openreview.net

November 23, 2024 at 8:06 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news