Lightnews — Scholar-powered news

Jim RB

@jbohnslav.bsky.social

480 followers 1.1K following 630 posts

computer vision + machine learning. Perception at Zoox. Prev: Cobot, PhD. Arxiv every day.

Posts Replies Media Videos

Pinned

Jim RB @jbohnslav.bsky.social · Dec 12

Hot take: We should stop calling image+text->text models 'MLLMs' because true multimodal models like Gemini Flash 2.0 are a different beast entirely.

Here's my deep dive into the naming debate with citations from 15+ recent papers (link in thread)

Jim RB

@jbohnslav.bsky.social

ZEBRA-CoT

Dataset for vision-language reasoning where the model *generates images during the CoT*. Example: for geometry problems, it's helpful to draw lines in image space.

182K CoT labels: math, visual search, robot planning, and more.

Only downside: cc-by-nc license :(

July 24, 2025 at 1:01 PM

Jim RB

@jbohnslav.bsky.social

Franca

Fully open vision encoder. Masks image, encodes patches, then trains student to match teacher's clusters. Key advance: Matryoshka clustering. Each slice of the embedding gets its own projection head and clustering objective. Fewer features == fewer clusters to match.

July 23, 2025 at 12:17 PM

Jim RB

@jbohnslav.bsky.social

VRU-Accident

New benchmark of 1K videos, 1K captions, and 6K MCQs from accidents involving VRUs. Example: "why did the accident happen?" "(B): pedestrian moves or stays on the road."

Current VLMs get ~50-65% accuracy, much worse than humans (95%).

July 15, 2025 at 3:13 PM

Jim RB

@jbohnslav.bsky.social

BlindSight

AMD paper: they find attention heads often have stereotyped sparsity patterns (e.g. only attending within an image, not across). They generate sparse attention variants for each prompt. Theoretically saves ~35% FLOPs for 1-2% worse on benches.

July 15, 2025 at 1:56 PM

Jim RB

@jbohnslav.bsky.social

Long-RL

Nvidia paper scaling RL to long videos. First trains with SFT on a synthetic long CoT dataset, then does GRPO with up to 512 video frames. Uses cached image embeddings + sequence parallelism, speeding up rollouts >2X.

Bonus: code is already up!

July 11, 2025 at 1:18 PM

Jim RB

@jbohnslav.bsky.social

Skywork-R1V3: new reasoning VLM with 76% MMMU.

InternViT-6B stitched with QwQ-32B. SFT warmup, GRPO on math, then a small SFT fine-tune at the end.

Good benches, actual ablations, and interesting discussion.

Details: 🧵

July 9, 2025 at 3:41 PM

Jim RB

@jbohnslav.bsky.social

MGPO: multi-turn grounding-based policy optimization.

I've been waiting for a paper like this! Trains the LLM to iteratively crop regions of interest to answer a question, and the only reward is the final answer.

Details in thread 👇

July 9, 2025 at 3:24 PM

Jim RB

@jbohnslav.bsky.social

DriveMRP: interesting method to get a VLM to understand BEV maps + driving scenarios

They synthesize high-risk scenes derived from NuPlan. They render it as both a bird's eye view image and a front camera view.

👇

July 8, 2025 at 2:03 PM

Jim RB

@jbohnslav.bsky.social

SeqGrowGraph

Instead of segment + postprocess, generate lane graphs autoregressively. Node == vertex in BEV space, edge == control point for Bezier curves. At each step, a vertex is added and the adjacency matrix adds one row + column.

They formulate this process as next token prediction. Neat!

July 8, 2025 at 1:51 PM

Jim RB

@jbohnslav.bsky.social

GLM-4.1V-Thinking: new reasoning VLM with heavy emphasis on RL.

Tons of hints but few ablations 😞 eg they upweight difficult-but-learnable samples every iteration, but don't show how it compares to baseline.

9B variant beats Qwen2.5-VL-7B on many standard benchmarks.

Details in thread 👇

July 2, 2025 at 1:57 PM

Jim RB

@jbohnslav.bsky.social

DenseWorld-1M: insanely detailed + grounded caption dataset.

Synthetic data only: SAM, APE for segmentation. Each crop is captioned and verified. VLMs stitch object captions into huge image captions.

Beats Sa2VA on referring expression segmentation. Dataset improves Qwen2.5-VL on VQA benchmarks

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations ...

arxiv.org

July 1, 2025 at 2:27 PM

Jim RB

@jbohnslav.bsky.social

SAM4D: promptable 4D seg

Data engine uses Grounding-Dino + SAM2 for segmentation + tracking in images. Lidar -> voxels -> ray casting to match to pixels. Clustering to stitch visual masklets into 3D instances.

Model is like SAM2 but with a lidar encoder + motion-aware cross attention.

July 1, 2025 at 1:52 PM

Jim RB

@jbohnslav.bsky.social

MiCo: multi-image contrast

Simple idea: Input is multiple augmented images, either from video or image edits. Prompt: "are these images the same or different?" Train with GRPO.

Large bump in multi-image benchmarks, minor bump in general VQA / hallucination benches.

July 1, 2025 at 1:16 PM

Jim RB

@jbohnslav.bsky.social

SpatialReasoner-R1

Generates long CoT data with Multi-Model Monte Carlo Tree Search -- multiple candidate models for each step, evaluated by multiple LLM judges. DPO with separate losses on the "descriptive" caption and reasoning.

Huge improvements on spatial datasets, good performance on VQA.

July 1, 2025 at 1:12 PM

Jim RB

@jbohnslav.bsky.social

ScaleCap: synthetic image captioning pipeline. 2542 characters per cap.

Given a caption, generates followup questions and answers with a VLM. Compute P(sentence | image,prompt) - P(sentence|prompt). Sentences with low scores are only using their text prior, so filter them out.

June 25, 2025 at 1:30 PM

Jim RB

@jbohnslav.bsky.social

UniVLA: VQA, world modeling, and robotic controls all-in-one transformer

VQ-quantize images, DCT robotic actions, standard text tokens. Train the whole interleaved sequence with NTP.

This type of approach would benefit from serious scaling. Anyone have a few thousand H100s laying around?

June 25, 2025 at 1:00 PM

Jim RB

@jbohnslav.bsky.social

ST-Kit: new dataset + benchmark for kinematic understanding for VLMs.

Examples: estimate the total distance covered by <object> in the video? In what direction does <object> move in BEV coordinates?

Open source and closed source models do poorly but their fine-tune does well.

April 8, 2025 at 1:53 PM

Jim RB

@jbohnslav.bsky.social

Vision-R1: another RLVR for VLMs, this time for detection.

Rewards: formatting, precision, recall. Adds +9mAP to Qwen-2.5-VL on COCO + ODINW 🤯. They change the threshold in a curriculum (easy -> hard) which adds ~2 points.

Repo has training code based on R1-V and TRL 👍

April 8, 2025 at 1:50 PM

Jim RB

@jbohnslav.bsky.social

CAR-1000

Move over Stanford Cars, new dataset with 1000 hierarchical classes and 140,312 samples! Scraped from a Chinese car enthusiast forum, there's a wide variety of cars from around the world.

April 8, 2025 at 1:44 PM

Jim RB

@jbohnslav.bsky.social

NuPlanQA: uses nuPlan annotations + GPT4o to 1 million QA pairs for training and an 8K multiple-choice benchmark.

BEV-LLM: baseline model. Multi-view video -> BEVFusion -> cross-attend with image features -> projector -> LLaMA3.2.

Some helpful ablations on # views and frames.

April 8, 2025 at 1:41 PM

Jim RB

@jbohnslav.bsky.social

Hydra-NeXt: strong E2E closed-loop driving performance with only open-loop training

Images -> encoder -> 4096 discrete trajectory vocab -> transformer -> bicycle model -> denoising diffusion refinement -> best trajectory selection.

Much better closed-loop perf than UniAD, VAD

April 3, 2025 at 1:39 PM

Jim RB

@jbohnslav.bsky.social

Nice methods paper (/ advertisement) from Nvidia on training video models with NeMo.

Nuts and bolts: use Ray + NVDEC for curation, S3 + WebDataset for dataloading, FSDP + TP + Context Parallelism + PP for video DiT training.

48.2% MFU

April 3, 2025 at 1:18 PM

Jim RB

@jbohnslav.bsky.social

ChatBEV: new dataset + benchmark for VQA on BEV maps for autonomous driving.

Data from nuPlan. 116K train, 21K test. Render a BEV map + generate VQA with LLMs(?)

Applications: VQA, use the model to condition a scene generator.

April 3, 2025 at 12:59 PM

Jim RB

@jbohnslav.bsky.social

TrajHF: diffusion-based planner fine-tuned with RLHF.

Simple idea: humans have preferences for candidate trajectories. Collect human feedback on key frames from 4.5K clips mined for aggressive maneuvers.

Improves these aggressive scenarios but worsens some open-loop metrics.

March 27, 2025 at 8:09 PM

Jim RB

@jbohnslav.bsky.social

DriveLMM-o1: new dataset + benchmark for autonomy VQA.

Dataset is 18k QA pairs, each with step-by-step reasoning. Generated with GPT4o and human verified. Model is a fine-tuned InternVL2.5-8b.

Nit: don't call your model o1 if you don't use RLVR!

March 27, 2025 at 12:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news