Moayed Haji ALi
moayedha.bsky.social
Moayed Haji ALi
@moayedha.bsky.social
Phd @RiceUniversity | Research Intern @Snap
While current approaches uses external pretrained features (e.g. Meta CLIP, BEATs), we found that diffusion activations hold rich, semantically and temporally aware features, making them perfect for cross-modal generation in a self-contained framework.

🔊➡️📽️ Example:
January 14, 2025 at 6:13 PM
Besides Video to Audio (📽️ ➡️🔊), we also support Audio to Video (🔊➡️📽️) generation under the same unified framework.
January 14, 2025 at 6:13 PM
Compared to Meta Movie Gen Video to Audio, we achieve significantly better temporal synchronization with a 90% smaller scale model.
January 14, 2025 at 6:13 PM
recise temporal synchronization remains a significant challenge for current video-to-audio models. AV-Link addresses this by leveraging diffusion features to accurately capture both local and global temporal events, such as hand slides on a guitar and fretboard pitch changes.
January 14, 2025 at 6:13 PM
Can pretrained diffusion models be connected for cross-modal generation?

📢 Introducing AV-Link ♾️

Bridging unimodal diffusion models in one self-contained framework to enable:
📽️ ➡️ 🔊 Video-to-Audio generation.
🔊 ➡️ 📽️ Audio-to-Video generation.

🌐: snap-research.github.io/AVLink/

⤵️ Results
January 14, 2025 at 6:13 PM