Lightnews — Scholar-powered news

Yinglun Zhu

@yinglunzhu.bsky.social

For more details, please check out our

Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...

Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!

ttm

yinglunz.com

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Two takeaways:

1. Eval lies at the heart of AI progress.
2. Iterative, matching-based self-improvements works -- and should be explored beyond compositional reasoning!

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

TTM can also be extended to datasets without local groups -- by treating the entire dataset as a global assignment problem between all images and captions (solved in polynomial time).

The global TTM variant achieves up to 33.3% relative error reduction.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

TTM isn’t limited to benchmarks with k-by-k groups.

For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

TTM provides substantial improvements on top of SimpleMatch, without external supervision.

Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.

Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

To push further, we develop Test-Time Matching (TTM), an iterative, self-improving algorithm with two key components:

(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

SimpleMatch reveals substantial hidden capability -- it enables SigLIP-B16 to surpass all prior results and GPT-4.1 to achieve the first result surpassing human performance on Winoground.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Because a correct GroupMatch also guarantees a perfect GroupScore, this creates an arbitrage opportunity via a two-step SimpleMatch procedure:

1. Select the most likely matching under GroupMatch.
2. Overfit to that matching at test time.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

We introduce a new GroupMatch metric that evaluates the best overall matching instead of isolated pairwise comparisons.

This increases the random-guessing success rate to 1/k! (from 1/6 to 1/2 when k = 2).

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

The widely used GroupScore metric requires one-to-one alignment between k images and k captions without enforcing consistency -- a single collision means failure.

Under random guessing, the success rate is (k-1)! / (2k-1)! → only 1/6 when k = 2.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Multimodal models, even frontier ones, have long been reported to perform at or below random guessing on compositional reasoning benchmarks.

Why does this happen?

We find that part of the difficulty lies in the evaluation metric itself.

October 31, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Paper: yinglunz.com/pdfs/dtrl.pdf

Joint work with my student Junkai Luo.
Feedback welcome! 🙌

yinglunz.com

October 14, 2025 at 7:04 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Our algorithm achieves SOTA performance across multiple benchmarks.

We hope these ideas also inspire improvements to GRPO for LLMs—especially in credit assignment.

October 14, 2025 at 7:04 PM

Yinglun Zhu

@yinglunzhu.bsky.social

💡 Building on this insight, we adapt GRPO to online finetuning of DTs, introducing:

• Sub-trajectory optimization → better credit assignment
• Sequence-level likelihood objectives (concurrent w/ GSPO) → stability & efficiency
• Active sampling → improved exploration in uncertain regions

October 14, 2025 at 7:04 PM

Yinglun Zhu

@yinglunzhu.bsky.social

🔍 We identify hindsight return relabeling as the key obstacle: while useful for supervised objectives, it destabilizes importance weights for RL methods like PPO and GRPO.

October 14, 2025 at 7:04 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Paper: arxiv.org/pdf/2510.03247
Joint work with my student Jiancheng Zhang.
Feedback welcome!

3/3

arxiv.org

October 10, 2025 at 6:04 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Our algorithm combines uncertainty and diversity principles in a modality-aware
design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. It achieves consistent gains over baselines across multiple benchmarks, including COCO and DataComp.

2/3

October 10, 2025 at 6:03 PM

Yinglun Zhu

@yinglunzhu.bsky.social

We hope this work inspires more research on adaptive, efficient deployment of LLMs—where compute is used strategically rather than blindly.

Joint work with my student Bowen Zuo 🙌
Feedback welcome!

July 1, 2025 at 6:45 PM

Yinglun Zhu

@yinglunzhu.bsky.social

Most methods allocate compute uniformly, ignoring variation in query difficulty.

We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.

📊 Example (avg. budget = 32):

(2/3)

July 1, 2025 at 6:45 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news