Lightnews — Scholar-powered news

Jim RB

@jbohnslav.bsky.social

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
arxiv: arxiv.org/abs/2507.16746
data: huggingface.co/datasets/mul...

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging ...

arxiv.org

July 24, 2025 at 1:01 PM

Jim RB

@jbohnslav.bsky.social

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
arxiv: arxiv.org/abs/2507.14137
code: github.com/valeoai/Franca

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art...

arxiv.org

July 23, 2025 at 12:17 PM

Jim RB

@jbohnslav.bsky.social

Cool technique: RASA, Removal of Absolute Spatial Attributes. They decode grid coords and find the plane in feature space that encodes position. They basically subtract this off, baking it into the last linear layer to leave the forward pass unchanged.

July 23, 2025 at 12:17 PM

Jim RB

@jbohnslav.bsky.social

Beats or is competitive to SigLIP/2, DinoV2 on linear eval, OOD detection, linear segmentation.

July 23, 2025 at 12:17 PM

Jim RB

@jbohnslav.bsky.social

VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding
arxiv: arxiv.org/abs/2507.098...
project: vru-accident.github.io

VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding

Ensuring the safety of vulnerable road users (VRUs), such as pedestrians and cyclists, is a critical challenge for autonomous driving systems, as crashes involving VRUs often result in severe or fatal...

arxiv.org

July 15, 2025 at 3:13 PM

Jim RB

@jbohnslav.bsky.social

BlindSight: Harnessing Sparsity for Efficient VLMs
arxiv: arxiv.org/abs/2507.090...

BlindSight: Harnessing Sparsity for Efficient VLMs

Large vision-language models (VLMs) enable the joint processing of text and images. However, the inclusion of vision data significantly expands the prompt length. Along with the quadratic complexity o...

arxiv.org

July 15, 2025 at 1:56 PM

Jim RB

@jbohnslav.bsky.social

Side note: I've always liked Pali/Gemma's Prefix-LM masking. Why have causal attention for image tokens?

July 15, 2025 at 1:56 PM

Jim RB

@jbohnslav.bsky.social

Scaling RL to Long Videos
arxiv: arxiv.org/abs/2507.07966
code: github.com/NVlabs/Long-RL

Scaling RL to Long Videos

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasonin...

arxiv.org

July 11, 2025 at 1:18 PM

Jim RB

@jbohnslav.bsky.social

Skywork-R1V3 Technical Report
arxiv: arxiv.org/abs/2507.06167
code: github.com/SkyworkAI/Sk...

Skywork-R1V3 Technical Report

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills f...

arxiv.org

July 9, 2025 at 3:41 PM

Jim RB

@jbohnslav.bsky.social

They identify entropy of "wait" or "alternatively" to be strongly correlated with MMMU. Neat!

July 9, 2025 at 3:41 PM

Jim RB

@jbohnslav.bsky.social

Fine-tuning the connector at the end gives a point or two of MMMU. I wonder how much of this is benchmaxxing--I haven't seen an additional SFT stage after RL before.

July 9, 2025 at 3:41 PM

Jim RB

@jbohnslav.bsky.social

They construct their warm-start SFT data with synthetic traces from Skywork-R1V2.

GRPO is pretty standard, interesting that they just did math instead of math, grounding, other possible RLVR tasks. Qwen-2.5-Instruct 32B to judges the accuracy of the answer in addition to rule-based verification.

July 9, 2025 at 3:41 PM

Jim RB

@jbohnslav.bsky.social

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
arxiv: arxiv.org/abs/2507.05920
code: github.com/EvolvingLMMs...

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the ...

arxiv.org

July 9, 2025 at 3:24 PM

Jim RB

@jbohnslav.bsky.social

Training: use verl with vLLM for rollouts. Limit image resolution to 1280 visual tokens. Train on 32 H100s.

Results: +18 points better on V* compared to Qwen2.5-VL, and +5 points better than GRPO alone.

July 9, 2025 at 3:24 PM

Jim RB

@jbohnslav.bsky.social

RL: GRPO. Reward: only correct answer, not valid grounding coordinates. Seems weird to not add that though.

Data: training subset of MME-RealWorld. Evaluate on V*.

July 9, 2025 at 3:24 PM

Jim RB

@jbohnslav.bsky.social

Uses Qwen2.5-VL as a base model. The NaViT encoder makes it easy to have many images of different shapes.

They use a SFT warm-start, as the VLMs struggled to output good grounding coordinates. They constructed two-turn samples for this.

July 9, 2025 at 3:24 PM

Jim RB

@jbohnslav.bsky.social

DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction
arxiv: arxiv.org/abs/2507.02948
code: github.com/hzy138/Drive...

DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a ma...

arxiv.org

July 8, 2025 at 2:03 PM

Jim RB

@jbohnslav.bsky.social

Using automatically generated risk category labels and the front-facing view, they have GPT4o caption the scenarios. The metrics are based on caption similarity + classification metrics on riskiness-type.

July 8, 2025 at 2:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news