Edson Araujo
edsonroteia.bsky.social
Edson Araujo
@edsonroteia.bsky.social
PhD Student at Goethe University of Frankfurt
edsonroteia.github.io
Thanks for all my co-authors: Andrew Rouditchenko, Yuan Gong, Saurabh Bhati, Samuel Thomas, Brian Kingsbury, @leokarlin.bsky.social, Rogerio Feris, James Glass, @hildekuehne.bsky.social!
Collaboration through MIT-IBM Watson AI Lab 🚀
And thank you Adam Zewe for covering our work on MIT News!
🧵(7/7)
https://cvprconference.bsky.social‬
May 22, 2025 at 1:46 PM
✨ Dive deeper into CAV-MAE Sync:
🔗 Paper: arxiv.org/abs/2505.01237
🔗 Project Page: edsonroteia.github.io/cav-mae-sync/
🔗 Code: github.com/edsonroteia/...
🔗 MIT News: news.mit.edu/2025/ai-lear...
🧵(6/7)
May 22, 2025 at 1:46 PM
📊 We evaluated CAV-MAE Sync on AudioSet, VGG Sound, & ADE20K Sound:
➡️ Achieved strong results in zero-shot retrieval, classification, & localization.
➡️ Outperforms more complex architectures, demonstrating the power of our apporach.
🧵(5/7)
May 22, 2025 at 1:46 PM
➡️ Improved Spatial Localization: Learnable "register tokens" are added to reduce the semantic load on patch tokens, helping the model focus on finer details for reconstruction.
🧵(4/7)
May 22, 2025 at 1:46 PM
💡 Our approach:
➡️ Fine-Grained Alignment: We treat audio as a temporal sequence, aligning it with video frames, moving beyond coarse representation.
➡️ Decoupled Objectives: "Global tokens" separate the contrastive learning objective from patch-level MAE reconstruction.
🧵(3/7)
May 22, 2025 at 1:46 PM
Problems with the original CAV-MAE [Gong et al. 2023]:
🔹 Global audio representations fail to capture fine-grained temporal correspondences with visual frames.
🔹 Jointly learning reconstruction & cross-modal alignment can lead to suboptimal performance.
🧵 (2/7)
May 22, 2025 at 1:46 PM