Here's my deep dive into the naming debate with citations from 15+ recent papers (link in thread)
Dataset for vision-language reasoning where the model *generates images during the CoT*. Example: for geometry problems, it's helpful to draw lines in image space.
182K CoT labels: math, visual search, robot planning, and more.
Only downside: cc-by-nc license :(
Dataset for vision-language reasoning where the model *generates images during the CoT*. Example: for geometry problems, it's helpful to draw lines in image space.
182K CoT labels: math, visual search, robot planning, and more.
Only downside: cc-by-nc license :(
Fully open vision encoder. Masks image, encodes patches, then trains student to match teacher's clusters. Key advance: Matryoshka clustering. Each slice of the embedding gets its own projection head and clustering objective. Fewer features == fewer clusters to match.
Fully open vision encoder. Masks image, encodes patches, then trains student to match teacher's clusters. Key advance: Matryoshka clustering. Each slice of the embedding gets its own projection head and clustering objective. Fewer features == fewer clusters to match.
New benchmark of 1K videos, 1K captions, and 6K MCQs from accidents involving VRUs. Example: "why did the accident happen?" "(B): pedestrian moves or stays on the road."
Current VLMs get ~50-65% accuracy, much worse than humans (95%).
New benchmark of 1K videos, 1K captions, and 6K MCQs from accidents involving VRUs. Example: "why did the accident happen?" "(B): pedestrian moves or stays on the road."
Current VLMs get ~50-65% accuracy, much worse than humans (95%).
AMD paper: they find attention heads often have stereotyped sparsity patterns (e.g. only attending within an image, not across). They generate sparse attention variants for each prompt. Theoretically saves ~35% FLOPs for 1-2% worse on benches.
AMD paper: they find attention heads often have stereotyped sparsity patterns (e.g. only attending within an image, not across). They generate sparse attention variants for each prompt. Theoretically saves ~35% FLOPs for 1-2% worse on benches.
Nvidia paper scaling RL to long videos. First trains with SFT on a synthetic long CoT dataset, then does GRPO with up to 512 video frames. Uses cached image embeddings + sequence parallelism, speeding up rollouts >2X.
Bonus: code is already up!
Nvidia paper scaling RL to long videos. First trains with SFT on a synthetic long CoT dataset, then does GRPO with up to 512 video frames. Uses cached image embeddings + sequence parallelism, speeding up rollouts >2X.
Bonus: code is already up!
InternViT-6B stitched with QwQ-32B. SFT warmup, GRPO on math, then a small SFT fine-tune at the end.
Good benches, actual ablations, and interesting discussion.
Details: 🧵
InternViT-6B stitched with QwQ-32B. SFT warmup, GRPO on math, then a small SFT fine-tune at the end.
Good benches, actual ablations, and interesting discussion.
Details: 🧵
I've been waiting for a paper like this! Trains the LLM to iteratively crop regions of interest to answer a question, and the only reward is the final answer.
Details in thread 👇
I've been waiting for a paper like this! Trains the LLM to iteratively crop regions of interest to answer a question, and the only reward is the final answer.
Details in thread 👇
They synthesize high-risk scenes derived from NuPlan. They render it as both a bird's eye view image and a front camera view.
👇
They synthesize high-risk scenes derived from NuPlan. They render it as both a bird's eye view image and a front camera view.
👇
Instead of segment + postprocess, generate lane graphs autoregressively. Node == vertex in BEV space, edge == control point for Bezier curves. At each step, a vertex is added and the adjacency matrix adds one row + column.
They formulate this process as next token prediction. Neat!
Instead of segment + postprocess, generate lane graphs autoregressively. Node == vertex in BEV space, edge == control point for Bezier curves. At each step, a vertex is added and the adjacency matrix adds one row + column.
They formulate this process as next token prediction. Neat!
Tons of hints but few ablations 😞 eg they upweight difficult-but-learnable samples every iteration, but don't show how it compares to baseline.
9B variant beats Qwen2.5-VL-7B on many standard benchmarks.
Details in thread 👇
Tons of hints but few ablations 😞 eg they upweight difficult-but-learnable samples every iteration, but don't show how it compares to baseline.
9B variant beats Qwen2.5-VL-7B on many standard benchmarks.
Details in thread 👇
Synthetic data only: SAM, APE for segmentation. Each crop is captioned and verified. VLMs stitch object captions into huge image captions.
Beats Sa2VA on referring expression segmentation. Dataset improves Qwen2.5-VL on VQA benchmarks
Synthetic data only: SAM, APE for segmentation. Each crop is captioned and verified. VLMs stitch object captions into huge image captions.
Beats Sa2VA on referring expression segmentation. Dataset improves Qwen2.5-VL on VQA benchmarks
Data engine uses Grounding-Dino + SAM2 for segmentation + tracking in images. Lidar -> voxels -> ray casting to match to pixels. Clustering to stitch visual masklets into 3D instances.
Model is like SAM2 but with a lidar encoder + motion-aware cross attention.
Data engine uses Grounding-Dino + SAM2 for segmentation + tracking in images. Lidar -> voxels -> ray casting to match to pixels. Clustering to stitch visual masklets into 3D instances.
Model is like SAM2 but with a lidar encoder + motion-aware cross attention.
Simple idea: Input is multiple augmented images, either from video or image edits. Prompt: "are these images the same or different?" Train with GRPO.
Large bump in multi-image benchmarks, minor bump in general VQA / hallucination benches.
Simple idea: Input is multiple augmented images, either from video or image edits. Prompt: "are these images the same or different?" Train with GRPO.
Large bump in multi-image benchmarks, minor bump in general VQA / hallucination benches.
Generates long CoT data with Multi-Model Monte Carlo Tree Search -- multiple candidate models for each step, evaluated by multiple LLM judges. DPO with separate losses on the "descriptive" caption and reasoning.
Huge improvements on spatial datasets, good performance on VQA.
Generates long CoT data with Multi-Model Monte Carlo Tree Search -- multiple candidate models for each step, evaluated by multiple LLM judges. DPO with separate losses on the "descriptive" caption and reasoning.
Huge improvements on spatial datasets, good performance on VQA.
Given a caption, generates followup questions and answers with a VLM. Compute P(sentence | image,prompt) - P(sentence|prompt). Sentences with low scores are only using their text prior, so filter them out.
Given a caption, generates followup questions and answers with a VLM. Compute P(sentence | image,prompt) - P(sentence|prompt). Sentences with low scores are only using their text prior, so filter them out.
VQ-quantize images, DCT robotic actions, standard text tokens. Train the whole interleaved sequence with NTP.
This type of approach would benefit from serious scaling. Anyone have a few thousand H100s laying around?
VQ-quantize images, DCT robotic actions, standard text tokens. Train the whole interleaved sequence with NTP.
This type of approach would benefit from serious scaling. Anyone have a few thousand H100s laying around?
Examples: estimate the total distance covered by <object> in the video? In what direction does <object> move in BEV coordinates?
Open source and closed source models do poorly but their fine-tune does well.
Examples: estimate the total distance covered by <object> in the video? In what direction does <object> move in BEV coordinates?
Open source and closed source models do poorly but their fine-tune does well.
Rewards: formatting, precision, recall. Adds +9mAP to Qwen-2.5-VL on COCO + ODINW 🤯. They change the threshold in a curriculum (easy -> hard) which adds ~2 points.
Repo has training code based on R1-V and TRL 👍
Rewards: formatting, precision, recall. Adds +9mAP to Qwen-2.5-VL on COCO + ODINW 🤯. They change the threshold in a curriculum (easy -> hard) which adds ~2 points.
Repo has training code based on R1-V and TRL 👍
Move over Stanford Cars, new dataset with 1000 hierarchical classes and 140,312 samples! Scraped from a Chinese car enthusiast forum, there's a wide variety of cars from around the world.
Move over Stanford Cars, new dataset with 1000 hierarchical classes and 140,312 samples! Scraped from a Chinese car enthusiast forum, there's a wide variety of cars from around the world.
BEV-LLM: baseline model. Multi-view video -> BEVFusion -> cross-attend with image features -> projector -> LLaMA3.2.
Some helpful ablations on # views and frames.
BEV-LLM: baseline model. Multi-view video -> BEVFusion -> cross-attend with image features -> projector -> LLaMA3.2.
Some helpful ablations on # views and frames.
Images -> encoder -> 4096 discrete trajectory vocab -> transformer -> bicycle model -> denoising diffusion refinement -> best trajectory selection.
Much better closed-loop perf than UniAD, VAD
Images -> encoder -> 4096 discrete trajectory vocab -> transformer -> bicycle model -> denoising diffusion refinement -> best trajectory selection.
Much better closed-loop perf than UniAD, VAD
Nuts and bolts: use Ray + NVDEC for curation, S3 + WebDataset for dataloading, FSDP + TP + Context Parallelism + PP for video DiT training.
48.2% MFU
Nuts and bolts: use Ray + NVDEC for curation, S3 + WebDataset for dataloading, FSDP + TP + Context Parallelism + PP for video DiT training.
48.2% MFU
Data from nuPlan. 116K train, 21K test. Render a BEV map + generate VQA with LLMs(?)
Applications: VQA, use the model to condition a scene generator.
Data from nuPlan. 116K train, 21K test. Render a BEV map + generate VQA with LLMs(?)
Applications: VQA, use the model to condition a scene generator.
Simple idea: humans have preferences for candidate trajectories. Collect human feedback on key frames from 4.5K clips mined for aggressive maneuvers.
Improves these aggressive scenarios but worsens some open-loop metrics.
Simple idea: humans have preferences for candidate trajectories. Collect human feedback on key frames from 4.5K clips mined for aggressive maneuvers.
Improves these aggressive scenarios but worsens some open-loop metrics.
Dataset is 18k QA pairs, each with step-by-step reasoning. Generated with GPT4o and human verified. Model is a fine-tuned InternVL2.5-8b.
Nit: don't call your model o1 if you don't use RLVR!
Dataset is 18k QA pairs, each with step-by-step reasoning. Generated with GPT4o and human verified. Model is a fine-tuned InternVL2.5-8b.
Nit: don't call your model o1 if you don't use RLVR!