yinglunz.com
Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...
Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!
Blog: yinglunz.com/blogs/ttm.html
Paper: arxiv.org/pdf/2510.07632
Code: github.com/yinglunz/tes...
Joint work with Jiancheng Zhang and Fuzhi Tang. Feedback and thoughts are very welcome!
1. Eval lies at the heart of AI progress.
2. Iterative, matching-based self-improvements works -- and should be explored beyond compositional reasoning!
1. Eval lies at the heart of AI progress.
2. Iterative, matching-based self-improvements works -- and should be explored beyond compositional reasoning!
The global TTM variant achieves up to 33.3% relative error reduction.
The global TTM variant achieves up to 33.3% relative error reduction.
For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.
For 1-by-k groups, GroupMatch = GroupScore, so metric change brings no benefit. Yet, TTM still delivers substantial improvements -- up to 85.7% -- on datasets such as SugarCrepe and WhatsUp.
Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.
Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa
Remarkably, TTM enables SigLIP-B16 (~ 0.2B params) to surpass GPT-4.1 on MMVP-VLM.
Shout out to the awesome authors behind SigLIP! @giffmana.ai @xzhai.bsky.social @kolesnikov.ch and Basil Mustafa
(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.
(i) GroupMatch-based pseudo-labels for stronger supervision.
(ii) A progressively decaying selection threshold schedule to gradually expand coverage across the test set.
1. Select the most likely matching under GroupMatch.
2. Overfit to that matching at test time.
1. Select the most likely matching under GroupMatch.
2. Overfit to that matching at test time.
This increases the random-guessing success rate to 1/k! (from 1/6 to 1/2 when k = 2).
This increases the random-guessing success rate to 1/k! (from 1/6 to 1/2 when k = 2).
Under random guessing, the success rate is (k-1)! / (2k-1)! → only 1/6 when k = 2.
Under random guessing, the success rate is (k-1)! / (2k-1)! → only 1/6 when k = 2.
Why does this happen?
We find that part of the difficulty lies in the evaluation metric itself.
Why does this happen?
We find that part of the difficulty lies in the evaluation metric itself.
We hope these ideas also inspire improvements to GRPO for LLMs—especially in credit assignment.
We hope these ideas also inspire improvements to GRPO for LLMs—especially in credit assignment.
• Sub-trajectory optimization → better credit assignment
• Sequence-level likelihood objectives (concurrent w/ GSPO) → stability & efficiency
• Active sampling → improved exploration in uncertain regions
• Sub-trajectory optimization → better credit assignment
• Sequence-level likelihood objectives (concurrent w/ GSPO) → stability & efficiency
• Active sampling → improved exploration in uncertain regions
design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. It achieves consistent gains over baselines across multiple benchmarks, including COCO and DataComp.
2/3
design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. It achieves consistent gains over baselines across multiple benchmarks, including COCO and DataComp.
2/3
Joint work with my student Bowen Zuo 🙌
Feedback welcome!
Joint work with my student Bowen Zuo 🙌
Feedback welcome!
We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.
📊 Example (avg. budget = 32):
(2/3)
We propose adaptive algorithms that estimate query difficulty on the fly and allocate compute strategically—just enough for easy queries and more for hard ones.
📊 Example (avg. budget = 32):
(2/3)