Music Flamingo from NVIDIA - scaling music understanding in audio-language models beyond surface-level captions.
Music Flamingo from NVIDIA - scaling music understanding in audio-language models beyond surface-level captions.
I believe speech synthesis is following the exact same path image generation took - and we’re about to see the same explosive leap.
I believe speech synthesis is following the exact same path image generation took - and we’re about to see the same explosive leap.
github.com/nari-labs/dia 🧵
github.com/nari-labs/dia 🧵
Very good release by udio, as one of the lacking features against Suno competition was ability to generate fast.
im still unsure about the trade-off between models but believing 1.5 is unmatched at "Ultra" quality
Very good release by udio, as one of the lacking features against Suno competition was ability to generate fast.
im still unsure about the trade-off between models but believing 1.5 is unmatched at "Ultra" quality
Generate videos in Gemini and Whisk with Veo 2
🔺 334
💬 23
🔗 HN Post | Article
paper: arxiv.org/abs/2411.198...
They built a pure transformer codec (1B params) with an FSQ bottleneck instead of the usual RVQ approach. At 400-700 bits per second, it produces extremely high quality speech - getting close to the original audio.
paper: arxiv.org/abs/2411.198...
They built a pure transformer codec (1B params) with an FSQ bottleneck instead of the usual RVQ approach. At 400-700 bits per second, it produces extremely high quality speech - getting close to the original audio.
Two years ago, Microsoft Speech published VALL-E, one of the most influential works that helped shift speech synthesis to a whole new scale. Last year, the same group published VALL-E 2, which is, in my opinion, a perfect "lessons learned" paper.
Two years ago, Microsoft Speech published VALL-E, one of the most influential works that helped shift speech synthesis to a whole new scale. Last year, the same group published VALL-E 2, which is, in my opinion, a perfect "lessons learned" paper.
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
paper: openreview.net/forum?id=RnB...
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers
paper: openreview.net/forum?id=RnB...