--> https://valeoai.github.io/ <--
R: Yes! π
Introducing Driving on Registers (DrivoR):
a pure Transformer backbone that achieves SOTA results in NAVSIM v1 / v2 and closed-loop HUGSIM evaluation.
Here is how π
Meet DrivoR (Driving on Registers): our latest end2end autonomous driving model.
We teared down complex dependencies & modules from current models to
obtain a pure Transformer-based SOTA driving agent (NAVSIM v1 & v2, HUGSIM).
Find out more π
R: Yes! π
Introducing Driving on Registers (DrivoR):
a pure Transformer backbone that achieves SOTA results in NAVSIM v1 / v2 and closed-loop HUGSIM evaluation.
Here is how π
Meet DrivoR (Driving on Registers): our latest end2end autonomous driving model.
We teared down complex dependencies & modules from current models to
obtain a pure Transformer-based SOTA driving agent (NAVSIM v1 & v2, HUGSIM).
Find out more π
Congratulations to the whole team!
Congratulations to the whole team!
We were able to induce a more passive, safer driving styleβwhich proved important for reaching SOTA performance on the rigorous NAVSIM-v2 benchmark. π‘οΈ
We were able to induce a more passive, safer driving styleβwhich proved important for reaching SOTA performance on the rigorous NAVSIM-v2 benchmark. π‘οΈ
Thanks to the wizardry of Yihong Xu, we discovered that disentangling the tokens used for generation from those used for scoring was key.
Thanks to the wizardry of Yihong Xu, we discovered that disentangling the tokens used for generation from those used for scoring was key.
We pay max attention to the road ahead (front camera), while only occasionally glancing at the rear (back camera).
Visualizing the attention maps confirms this: front tokens specialize; back tokens collapse to a single pattern.
We pay max attention to the road ahead (front camera), while only occasionally glancing at the rear (back camera).
Visualizing the attention maps confirms this: front tokens specialize; back tokens collapse to a single pattern.
Cosine similarity analysis reveals high differentiation for the front camera, while representations progressively "collapse" as we move toward the back camera.
Cosine similarity analysis reveals high differentiation for the front camera, while representations progressively "collapse" as we move toward the back camera.
We imbue DINOv2 with registers LoRA-finetuned on driving data, reducing the # of patch tokens over 250x using camera aware register tokens.
This efficiency could impact future works on VLMs in driving
We imbue DINOv2 with registers LoRA-finetuned on driving data, reducing the # of patch tokens over 250x using camera aware register tokens.
This efficiency could impact future works on VLMs in driving
R: Yes! π
Introducing Driving on Registers (DrivoR):
a pure Transformer backbone that achieves SOTA results in NAVSIM v1 / v2 and closed-loop HUGSIM evaluation.
Here is how π
R: Yes! π
Introducing Driving on Registers (DrivoR):
a pure Transformer backbone that achieves SOTA results in NAVSIM v1 / v2 and closed-loop HUGSIM evaluation.
Here is how π
The talk will be live-streamed: www.hi-paris.fr/2025/09/26/a...
The talk will be live-streamed: www.hi-paris.fr/2025/09/26/a...
@abursuc.bsky.social taking the stage this afternoon! π
The morning keynotes talked a lot about open source so my slide here might be timely.
@abursuc.bsky.social taking the stage this afternoon! π
by: Y. Yin, S. Venkataramanan, T.H. Vu, A. Bursuc, M. Cord
π: arxiv.org/abs/2509.04398
tl;dr: a PEFT method that improves upon LoRA by explicitly preserving information in the low-rank space
by: Y. Yin, S. Venkataramanan, T.H. Vu, A. Bursuc, M. Cord
π: arxiv.org/abs/2509.04398
tl;dr: a PEFT method that improves upon LoRA by explicitly preserving information in the low-rank space
by: A. Gerontopoulos, S. Gidaris, N. Komodakis
π: arxiv.org/abs/2505.10518
tl;dr: a simple way to enable multi-token prediction in LLMs by interleaving learnable "register tokens" into the input sequence to forecast future targets.
by: A. Gerontopoulos, S. Gidaris, N. Komodakis
π: arxiv.org/abs/2505.10518
tl;dr: a simple way to enable multi-token prediction in LLMs by interleaving learnable "register tokens" into the input sequence to forecast future targets.
by: T. Kouzelis, E. Karypidis, I. Kakogeorgiou, S. Gidaris, N. Komodakis
π: arxiv.org/abs/2504.16064
- tl;dr: improve generation w/ a single diffusion model to jointly synthesize low-level latents & high-level semantic features
by: T. Kouzelis, E. Karypidis, I. Kakogeorgiou, S. Gidaris, N. Komodakis
π: arxiv.org/abs/2504.16064
- tl;dr: improve generation w/ a single diffusion model to jointly synthesize low-level latents & high-level semantic features
by: J. Parekh, P. Khayatan, M. Shukor, A. Dapogny, A. Newson, M. Cord
π: arxiv.org/abs/2508.12815
- tl;dr: steering multimodal LLMs (MLLMs) by training a lightweight auxiliary module to predict input-specific steering vectors
by: J. Parekh, P. Khayatan, M. Shukor, A. Dapogny, A. Newson, M. Cord
π: arxiv.org/abs/2508.12815
- tl;dr: steering multimodal LLMs (MLLMs) by training a lightweight auxiliary module to predict input-specific steering vectors
by E. Karypidis, I. Kakogeorgiou, S. Gidaris, N. Komodakis
π: arxiv.org/abs/2412.11673
tl;dr: self-supervision by predicting future scene dynamics in the semantic feature space of foundation models (like DINO) rather than generating costly pixels.
by E. Karypidis, I. Kakogeorgiou, S. Gidaris, N. Komodakis
π: arxiv.org/abs/2412.11673
tl;dr: self-supervision by predicting future scene dynamics in the semantic feature space of foundation models (like DINO) rather than generating costly pixels.
by P. Couairon, L. Chambon, L. Serrano, M. Cord, N. Thome
π: arxiv.org/abs/2506.11136
tl;dr: lightweight, flexible, plug & play upsampler that scales features from any vision foundation model to arbitrary resolutions w/o needing high-res supervision
by P. Couairon, L. Chambon, L. Serrano, M. Cord, N. Thome
π: arxiv.org/abs/2506.11136
tl;dr: lightweight, flexible, plug & play upsampler that scales features from any vision foundation model to arbitrary resolutions w/o needing high-res supervision
We present 5 full papers + 1 workshop about:
π‘ self-supervised & representation learning
πΌοΈ generative image models
π§ finetuning and understanding LLMs & multimodal LLMs
π feature upsampling
valeoai.github.io/posts/neurip...
We present 5 full papers + 1 workshop about:
π‘ self-supervised & representation learning
πΌοΈ generative image models
π§ finetuning and understanding LLMs & multimodal LLMs
π feature upsampling
valeoai.github.io/posts/neurip...
We found an asymmetry in LoRA: during training, A changes little & B eats most task-specific adaptation.
So we pre-train A to preserve information before adaptation w/ excellent parameter efficiency #NeurIPS2025 #CCFM π
Finetuning large models is cheaper thanks to LoRA, but is its random init optimal?π€
Meet IPA: a feature-aware alternative to random projections
#NeurIPS2025 WS #CCFM Oral+Best Paper
Work w/
S. Venkataramanan @tuanhungvu.bsky.social @abursuc.bsky.social M. Cord
π§΅
We found an asymmetry in LoRA: during training, A changes little & B eats most task-specific adaptation.
So we pre-train A to preserve information before adaptation w/ excellent parameter efficiency #NeurIPS2025 #CCFM π
Finetuning large models is cheaper thanks to LoRA, but is its random init optimal?π€
Meet IPA: a feature-aware alternative to random projections
#NeurIPS2025 WS #CCFM Oral+Best Paper
Work w/
S. Venkataramanan @tuanhungvu.bsky.social @abursuc.bsky.social M. Cord
π§΅
Finetuning large models is cheaper thanks to LoRA, but is its random init optimal?π€
Meet IPA: a feature-aware alternative to random projections
#NeurIPS2025 WS #CCFM Oral+Best Paper
Work w/
S. Venkataramanan @tuanhungvu.bsky.social @abursuc.bsky.social M. Cord
π§΅
We were curious if we could train diffusion models on sets of point coordinates.
For images, this is a step towards spatial diffusion, with pixels reorganizing themselves, instead of diffusing in rgb values space only.
by: E. Kirby, @mickaelchen.bsky.social, R. Marlet, N. Samet
tl;dr: a diffusion-based method producing lidar point clouds of dataset objects, with an extensive control of the generation
π arxiv.org/abs/2412.07385
Code: β
We were curious if we could train diffusion models on sets of point coordinates.
For images, this is a step towards spatial diffusion, with pixels reorganizing themselves, instead of diffusing in rgb values space only.
NAF outperform both VFM-specific upsamplers (FeatUp, JAFAR) and VFM-agnostic methods (JBU, AnyUp) over multiple downstream tasks π
πIntroducing NAF: A universal, zero-shot feature upsampler.
It turns low-res ViT features into pixel-perfect maps.
-β‘ Model-agnostic
-π₯ SoTA results
-π 4Γ faster than SoTA
-π Scales up to 2K res
NAF outperform both VFM-specific upsamplers (FeatUp, JAFAR) and VFM-agnostic methods (JBU, AnyUp) over multiple downstream tasks π
by LoΓ―ck Chambon (loickch.github.io), @paulcouairon.bsky.social, @eloizablocki.bsky.social, @alexandreboulch.bsky.social, @nicolasthome.bsky.social, @matthieucord.bsky.social
Collab with @mlia-isir.bsky.social
by LoΓ―ck Chambon (loickch.github.io), @paulcouairon.bsky.social, @eloizablocki.bsky.social, @alexandreboulch.bsky.social, @nicolasthome.bsky.social, @matthieucord.bsky.social
Collab with @mlia-isir.bsky.social
The repo contains:
β Pretrained model
β Example notebooks
β Evaluation and training codes
Check it out & β the repo: github.com/valeoai/NAF
The repo contains:
β Pretrained model
β Example notebooks
β Evaluation and training codes
Check it out & β the repo: github.com/valeoai/NAF
If you are using bilinear interpolation anywhere, NAF acts as a strict drop-in replacement.
Just swap it in. No retraining required. Itβs literally free points for your metrics.π
If you are using bilinear interpolation anywhere, NAF acts as a strict drop-in replacement.
Just swap it in. No retraining required. Itβs literally free points for your metrics.π