👨🔬 Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus
📍 TU Eindhoven, Polytechnic of Turin, RWTH Aachen University
#ComputerVision #DeepLearning #ViT #ImageSegmentation #EoMT #CVPR2025
(6/6)
👨🔬 Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, Daan de Geus
📍 TU Eindhoven, Polytechnic of Turin, RWTH Aachen University
#ComputerVision #DeepLearning #ViT #ImageSegmentation #EoMT #CVPR2025
(6/6)
We’re excited to see what you build on top of it. 🛠️
🌐 Project: tue-mps.github.io/eomt
📝 Paper: arxiv.org/abs/2503.19108
💻 Code: github.com/tue-mps/eomt
🤗 Models: huggingface.co/tue-mps
(5/6)
We’re excited to see what you build on top of it. 🛠️
🌐 Project: tue-mps.github.io/eomt
📝 Paper: arxiv.org/abs/2503.19108
💻 Code: github.com/tue-mps/eomt
🤗 Models: huggingface.co/tue-mps
(5/6)
Large ViTs pre-trained on rich visual data (like DINOv2 🦖) can learn the inductive biases needed for segmentation, with no extra components required.
✅ EoMT removes the clutter and lets the ViT do it all.
(4/6)
Large ViTs pre-trained on rich visual data (like DINOv2 🦖) can learn the inductive biases needed for segmentation, with no extra components required.
✅ EoMT removes the clutter and lets the ViT do it all.
(4/6)
✅ EoMT achieves an optimal trade-off between accuracy (PQ) 📊 and speed (FPS) ⚡ on COCO, thanks to its simple encoder-only design.
❌ No complex additional components.
❌ No bottlenecks.
🚀 Just performance.
(3/6)
✅ EoMT achieves an optimal trade-off between accuracy (PQ) 📊 and speed (FPS) ⚡ on COCO, thanks to its simple encoder-only design.
❌ No complex additional components.
❌ No bottlenecks.
🚀 Just performance.
(3/6)
🚫 They chain together complex components:
ViT → Adapter → Pixel Decoder → Transformer Decoder…
✅ EoMT removes them all.
It keeps only the ViT and adds a few query tokens that guide it to predict masks, no decoder needed.
(2/6)
🚫 They chain together complex components:
ViT → Adapter → Pixel Decoder → Transformer Decoder…
✅ EoMT removes them all.
It keeps only the ViT and adds a few query tokens that guide it to predict masks, no decoder needed.
(2/6)