Yinglun Zhu
yinglunzhu.bsky.social
Yinglun Zhu
@yinglunzhu.bsky.social
Assistant Prof @ UC Riverside. Research on Efficient ML, RL, and LLMs. CS PhD @ UW Madison.

yinglunz.com
For more details, please check out our

Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...

Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!
ttm
yinglunz.com
October 31, 2025 at 6:03 PM
Two takeaways:

1. Eval lies at the heart of AI progress.
2. Iterative, matching-based self-improvements works -- and should be explored beyond compositional reasoning!
October 31, 2025 at 6:03 PM
TTM can also be extended to datasets without local groups -- by treating the entire dataset as a global assignment problem between all images and captions (solved in polynomial time).

The global TTM variant achieves up to 33.3% relative error reduction.
October 31, 2025 at 6:03 PM
TTM isn’t limited to benchmarks with k-by-k groups.

For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.
October 31, 2025 at 6:03 PM
TTM provides substantial improvements on top of SimpleMatch, without external supervision.

Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.

Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa
October 31, 2025 at 6:03 PM
To push further, we develop Test-Time Matching (TTM), an iterative, self-improving algorithm with two key components:

(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.
October 31, 2025 at 6:03 PM
SimpleMatch reveals substantial hidden capability -- it enables SigLIP-B16 to surpass all prior results and GPT-4.1 to achieve the first result surpassing human performance on Winoground.
October 31, 2025 at 6:03 PM
Because a correct GroupMatch also guarantees a perfect GroupScore, this creates an arbitrage opportunity via a two-step SimpleMatch procedure:

1. Select the most likely matching under GroupMatch.
2. Overfit to that matching at test time.
October 31, 2025 at 6:03 PM
We introduce a new GroupMatch metric that evaluates the best overall matching instead of isolated pairwise comparisons.

This increases the random-guessing success rate to 1/k! (from 1/6 to 1/2 when k = 2).
October 31, 2025 at 6:03 PM
The widely used GroupScore metric requires one-to-one alignment between k images and k captions without enforcing consistency -- a single collision means failure.

Under random guessing, the success rate is (k-1)! / (2k-1)! → only 1/6 when k = 2.
October 31, 2025 at 6:03 PM
Multimodal models, even frontier ones, have long been reported to perform at or below random guessing on compositional reasoning benchmarks.

Why does this happen?

We find that part of the difficulty lies in the evaluation metric itself.
October 31, 2025 at 6:03 PM
Paper: yinglunz.com/pdfs/dtrl.pdf

Joint work with my student Junkai Luo.
Feedback welcome! 🙌
yinglunz.com
October 14, 2025 at 7:04 PM
Our algorithm achieves SOTA performance across multiple benchmarks.

We hope these ideas also inspire improvements to GRPO for LLMs—especially in credit assignment.
October 14, 2025 at 7:04 PM
💡 Building on this insight, we adapt GRPO to online finetuning of DTs, introducing:

• Sub-trajectory optimization → better credit assignment
• Sequence-level likelihood objectives (concurrent w/ GSPO) → stability & efficiency
• Active sampling → improved exploration in uncertain regions
October 14, 2025 at 7:04 PM
🔍 We identify hindsight return relabeling as the key obstacle: while useful for supervised objectives, it destabilizes importance weights for RL methods like PPO and GRPO.
October 14, 2025 at 7:04 PM
Paper: arxiv.org/pdf/2510.03247
Joint work with my student Jiancheng Zhang.
Feedback welcome!

3/3
arxiv.org
October 10, 2025 at 6:04 PM
Our algorithm combines uncertainty and diversity principles in a modality-aware
design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. It achieves consistent gains over baselines across multiple benchmarks, including COCO and DataComp.

2/3
October 10, 2025 at 6:03 PM
We hope this work inspires more research on adaptive, efficient deployment of LLMs—where compute is used strategically rather than blindly.

Joint work with my student Bowen Zuo 🙌
Feedback welcome!
July 1, 2025 at 6:45 PM
Most methods allocate compute uniformly, ignoring variation in query difficulty.

We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.

📊 Example (avg. budget = 32):

(2/3)
July 1, 2025 at 6:45 PM