TimDarcet
@timdarcet.bsky.social
PhD student, SSL for vision @ MetaAI & INRIA
tim.darcet.fr
tim.darcet.fr
Code and weights are Apache2 so don’t hesitate to try it out!
If you have torch you can load the models in a single line anywhere
The repo is a flat folder of like 10 files, it should be pretty readable
If you have torch you can load the models in a single line anywhere
The repo is a flat folder of like 10 files, it should be pretty readable
February 14, 2025 at 6:05 PM
Code and weights are Apache2 so don’t hesitate to try it out!
If you have torch you can load the models in a single line anywhere
The repo is a flat folder of like 10 files, it should be pretty readable
If you have torch you can load the models in a single line anywhere
The repo is a flat folder of like 10 files, it should be pretty readable
Qualitatively the features are pretty good imo
DINOv2+reg still has artifacts (despite my best efforts), while the MAE features are mostly color, not semantics (see shadow in the first image, or rightmost legs in the second)
CAPI has both semantic and smooth feature maps
DINOv2+reg still has artifacts (despite my best efforts), while the MAE features are mostly color, not semantics (see shadow in the first image, or rightmost legs in the second)
CAPI has both semantic and smooth feature maps
February 14, 2025 at 6:05 PM
Qualitatively the features are pretty good imo
DINOv2+reg still has artifacts (despite my best efforts), while the MAE features are mostly color, not semantics (see shadow in the first image, or rightmost legs in the second)
CAPI has both semantic and smooth feature maps
DINOv2+reg still has artifacts (despite my best efforts), while the MAE features are mostly color, not semantics (see shadow in the first image, or rightmost legs in the second)
CAPI has both semantic and smooth feature maps
But segmentation is where CAPI really shines: it even beats DINOv2+reg in some cases, esp. on k-nn segmentation
Compared to baselines, again, it’s quite good, with eg +8 points compared to MAE trained on the same dataset
Compared to baselines, again, it’s quite good, with eg +8 points compared to MAE trained on the same dataset
February 14, 2025 at 6:05 PM
But segmentation is where CAPI really shines: it even beats DINOv2+reg in some cases, esp. on k-nn segmentation
Compared to baselines, again, it’s quite good, with eg +8 points compared to MAE trained on the same dataset
Compared to baselines, again, it’s quite good, with eg +8 points compared to MAE trained on the same dataset
We test 2 dataset settings: “pretrain on ImageNet1k”, and “pretrain on bigger datasets”
In both we significantly improve over previous models
Training on a Places205, is better for scenes (P205, SUN) but worse for object-centric
In both we significantly improve over previous models
Training on a Places205, is better for scenes (P205, SUN) but worse for object-centric
February 14, 2025 at 6:05 PM
We test 2 dataset settings: “pretrain on ImageNet1k”, and “pretrain on bigger datasets”
In both we significantly improve over previous models
Training on a Places205, is better for scenes (P205, SUN) but worse for object-centric
In both we significantly improve over previous models
Training on a Places205, is better for scenes (P205, SUN) but worse for object-centric
Enough talk, we want numbers. I think they are really good! CAPI is not beating DINOv2+reg yet, but it sounds possible now. it closes most of the 4-points gap between previous MIM and DINOv2+reg, w/ encouraging scaling trends.
February 14, 2025 at 6:05 PM
Enough talk, we want numbers. I think they are really good! CAPI is not beating DINOv2+reg yet, but it sounds possible now. it closes most of the 4-points gap between previous MIM and DINOv2+reg, w/ encouraging scaling trends.
Plenty of other ablations, have fun absorbing the signal. Also the registers are crucial, since we use our own feature maps as targets, so we really don’t want artifacts.
February 14, 2025 at 6:05 PM
Plenty of other ablations, have fun absorbing the signal. Also the registers are crucial, since we use our own feature maps as targets, so we really don’t want artifacts.
Masking strategy: it makes a big diff.
“Inverse block” > “block” > “random”
*But* w/ inv. block, you will oversample the center to be masked out
→we propose a random circular shift (torch.roll). Prevents that, gives us a good boost.
“Inverse block” > “block” > “random”
*But* w/ inv. block, you will oversample the center to be masked out
→we propose a random circular shift (torch.roll). Prevents that, gives us a good boost.
February 14, 2025 at 6:05 PM
Masking strategy: it makes a big diff.
“Inverse block” > “block” > “random”
*But* w/ inv. block, you will oversample the center to be masked out
→we propose a random circular shift (torch.roll). Prevents that, gives us a good boost.
“Inverse block” > “block” > “random”
*But* w/ inv. block, you will oversample the center to be masked out
→we propose a random circular shift (torch.roll). Prevents that, gives us a good boost.
In practice, cross-attn works really well. Not mentioned in the table is that the cross-attn predictor is 18% faster than the self-attn predictor, and 44% faster than the fused one, so that’s a sweet bonus.
February 14, 2025 at 6:05 PM
In practice, cross-attn works really well. Not mentioned in the table is that the cross-attn predictor is 18% faster than the self-attn predictor, and 44% faster than the fused one, so that’s a sweet bonus.
3. Pred arch?
fused (a): 1 transformer w/ all tokens
split (b): enc w/ no [MSK], pred w/ all tokens
cross (c): no patch tokens in pred, cross-attend to them
Patch tokens are the encoder’s problem, [MSK] are the predictor’s problem. Tidy. Good.
fused (a): 1 transformer w/ all tokens
split (b): enc w/ no [MSK], pred w/ all tokens
cross (c): no patch tokens in pred, cross-attend to them
Patch tokens are the encoder’s problem, [MSK] are the predictor’s problem. Tidy. Good.
February 14, 2025 at 6:05 PM
3. Pred arch?
fused (a): 1 transformer w/ all tokens
split (b): enc w/ no [MSK], pred w/ all tokens
cross (c): no patch tokens in pred, cross-attend to them
Patch tokens are the encoder’s problem, [MSK] are the predictor’s problem. Tidy. Good.
fused (a): 1 transformer w/ all tokens
split (b): enc w/ no [MSK], pred w/ all tokens
cross (c): no patch tokens in pred, cross-attend to them
Patch tokens are the encoder’s problem, [MSK] are the predictor’s problem. Tidy. Good.
Empirically: using a direct loss is weaker, the iBOT loss does not work alone, using a linear student head to predict the CAPI targets works better than a MLP head.
So we use exactly that.
So we use exactly that.
February 14, 2025 at 6:05 PM
Empirically: using a direct loss is weaker, the iBOT loss does not work alone, using a linear student head to predict the CAPI targets works better than a MLP head.
So we use exactly that.
So we use exactly that.
2. Loss?
“DINO head”: good results, too unstable
Idea: preds and targets have diff. distribs, so EMA head does not work on targets → need to separate the 2 heads
So we just use a clustering on the target side instead, and it works
“DINO head”: good results, too unstable
Idea: preds and targets have diff. distribs, so EMA head does not work on targets → need to separate the 2 heads
So we just use a clustering on the target side instead, and it works
February 14, 2025 at 6:05 PM
2. Loss?
“DINO head”: good results, too unstable
Idea: preds and targets have diff. distribs, so EMA head does not work on targets → need to separate the 2 heads
So we just use a clustering on the target side instead, and it works
“DINO head”: good results, too unstable
Idea: preds and targets have diff. distribs, so EMA head does not work on targets → need to separate the 2 heads
So we just use a clustering on the target side instead, and it works
1. target representation
MAE used raw pixels, BeiT a VQ-VAE. It works, it’s stable. But not good enough.
We use the model we are currently training. Promising (iBOT, D2V2), but super unstable. We do it not because it is easy but because it is hard etc
MAE used raw pixels, BeiT a VQ-VAE. It works, it’s stable. But not good enough.
We use the model we are currently training. Promising (iBOT, D2V2), but super unstable. We do it not because it is easy but because it is hard etc
February 14, 2025 at 6:05 PM
1. target representation
MAE used raw pixels, BeiT a VQ-VAE. It works, it’s stable. But not good enough.
We use the model we are currently training. Promising (iBOT, D2V2), but super unstable. We do it not because it is easy but because it is hard etc
MAE used raw pixels, BeiT a VQ-VAE. It works, it’s stable. But not good enough.
We use the model we are currently training. Promising (iBOT, D2V2), but super unstable. We do it not because it is easy but because it is hard etc
Let’s dissect a bit the anatomy of a mask image model.
1. take an image, convert its patches to representations.
2. given part of this image, train a model to predict the content of the missing parts
3. measure a loss between pred and target
1. take an image, convert its patches to representations.
2. given part of this image, train a model to predict the content of the missing parts
3. measure a loss between pred and target
February 14, 2025 at 6:05 PM
Let’s dissect a bit the anatomy of a mask image model.
1. take an image, convert its patches to representations.
2. given part of this image, train a model to predict the content of the missing parts
3. measure a loss between pred and target
1. take an image, convert its patches to representations.
2. given part of this image, train a model to predict the content of the missing parts
3. measure a loss between pred and target
Language modeling solved NLP. So vision people have tried masked image modeling (MIM).
The issue? It’s hard. BeiT/MAE are not great for representations. iBOT works well, but is too unstable to train without DINO.
→Pure MIM lags behind DINOv2
The issue? It’s hard. BeiT/MAE are not great for representations. iBOT works well, but is too unstable to train without DINO.
→Pure MIM lags behind DINOv2
February 14, 2025 at 6:05 PM
Language modeling solved NLP. So vision people have tried masked image modeling (MIM).
The issue? It’s hard. BeiT/MAE are not great for representations. iBOT works well, but is too unstable to train without DINO.
→Pure MIM lags behind DINOv2
The issue? It’s hard. BeiT/MAE are not great for representations. iBOT works well, but is too unstable to train without DINO.
→Pure MIM lags behind DINOv2
Want strong SSL, but not the complexity of DINOv2?
CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
February 14, 2025 at 6:05 PM
Want strong SSL, but not the complexity of DINOv2?
CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
Hash functions are really useful to uniquely encode stuff without collision huh
January 7, 2025 at 2:15 PM
Hash functions are really useful to uniquely encode stuff without collision huh
At least there's diversity of opinions
December 27, 2024 at 6:56 PM
At least there's diversity of opinions
Excellent writeup on GPU streams / CUDA memory
dev-discuss.pytorch.org/t/fsdp-cudac...
TLDR by default mem is proper to a stream, to share it::
- `Tensor.record_stream` -> automatic, but can be suboptimal and nondeterministic
- `Stream.wait` -> manual, but precise control
dev-discuss.pytorch.org/t/fsdp-cudac...
TLDR by default mem is proper to a stream, to share it::
- `Tensor.record_stream` -> automatic, but can be suboptimal and nondeterministic
- `Stream.wait` -> manual, but precise control
November 24, 2024 at 10:04 PM
Excellent writeup on GPU streams / CUDA memory
dev-discuss.pytorch.org/t/fsdp-cudac...
TLDR by default mem is proper to a stream, to share it::
- `Tensor.record_stream` -> automatic, but can be suboptimal and nondeterministic
- `Stream.wait` -> manual, but precise control
dev-discuss.pytorch.org/t/fsdp-cudac...
TLDR by default mem is proper to a stream, to share it::
- `Tensor.record_stream` -> automatic, but can be suboptimal and nondeterministic
- `Stream.wait` -> manual, but precise control
Vision transformers need registers!
Or at least, it seems they 𝘸𝘢𝘯𝘵 some…
ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”.
Just add new tokens (“[reg]”):
- no artifacts
- interpretable attention maps 🦖
- improved performances!
arxiv.org/abs/2309.16588
Or at least, it seems they 𝘸𝘢𝘯𝘵 some…
ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”.
Just add new tokens (“[reg]”):
- no artifacts
- interpretable attention maps 🦖
- improved performances!
arxiv.org/abs/2309.16588
October 1, 2023 at 8:09 AM
Vision transformers need registers!
Or at least, it seems they 𝘸𝘢𝘯𝘵 some…
ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”.
Just add new tokens (“[reg]”):
- no artifacts
- interpretable attention maps 🦖
- improved performances!
arxiv.org/abs/2309.16588
Or at least, it seems they 𝘸𝘢𝘯𝘵 some…
ViTs have artifacts in attention maps. It’s due to the model using these patches as “registers”.
Just add new tokens (“[reg]”):
- no artifacts
- interpretable attention maps 🦖
- improved performances!
arxiv.org/abs/2309.16588