Mengye Ren
mengyer.bsky.social
Mengye Ren
@mengyer.bsky.social
@agentic-ai-lab.bsky.social
mengyeren.com
10/ This @agentic-ai-lab.bsky.social project was led by
Alex Wang @alexnwang.bsky.social and Chris Hoang @choang.bsky.social , together with Yuwen Xiong, @yann-lecun.bsky.social and @mengyer.bsky.social.
April 20, 2025 at 8:31 PM
9/ For more details, please check out our paper and website, or stop by our poster (Fri 10 AM, Hall 3 + Hall 2B #336) at ICLR!
Paper: arxiv.org/abs/2408.11208
Website: agenticlearning.ai/poodle/
PooDLe: Pooled and dense self-supervised learning from naturalistic videos
Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic ...
arxiv.org
April 20, 2025 at 8:31 PM
8/ We also study how data augmentation choices like crop scale, input resolution, and time between sampled frames can have a large impact on video pretraining.
April 20, 2025 at 8:31 PM
7/ These performance differences manifest visually too! IN1K has noisy segmentations and FlowE misses small objects, while PooDLe avoids both problems.
April 20, 2025 at 8:31 PM
6/ Interestingly, we find that dense SSL performance is driven by large classes whereas ImageNet pretraining does well on small, foreground classes.
PooDLe is able to perform well on both small and large classes!
April 20, 2025 at 8:31 PM
5/ PooDLe, pretrained on BDD100K and Walking Tours, outperforms prior iconic and dense SSL methods on semantic segmentation and object detection!
We also release WT-Sem, an in-distribution semantic segmentation task for Walking Tours.
April 20, 2025 at 8:31 PM
4/ We also propose a spatial decoder module to upsample the top-level features to higher resolution for the dense loss. The top-level features act as an information bottleneck that both satisfies the high-level invariance loss and is compatible with upsampling for the dense loss.
April 20, 2025 at 8:31 PM
3/ PooDLe addresses these challenges by unifying a dense, flow equivariance objective over global crops and a view invariance objective over smaller subcrops that serve as pseudo-iconic views. Crops are sampled from pairs of video frames, with motion as a natural augmentation.
April 20, 2025 at 8:31 PM
2/ Dense SSL methods account for multiple subjects by computing losses over corresponding spatial regions. However, we identify a new problem – spatial imbalance! Larger background regions like the sky are prioritized over smaller foreground objects like pedestrians.
April 20, 2025 at 8:31 PM
1/ Many SSL methods revolve around ImageNet, iconic images with single subjects and balanced classes, and rely on invariance losses between augmented views. These methods can struggle on naturalistic videos, which contain multiple subjects of varying size and imbalanced classes.
April 20, 2025 at 8:31 PM