Deqing Fu
deqing.bsky.social
Deqing Fu
@deqing.bsky.social
CS PhD Student @USC.
deqingfu.github.io
Finally, token-level annotations given by TLDR model could speedup human annotators to fix image captions that are slightly off. In fact, it can speed up human annotation by 3 times!
February 8, 2025 at 5:29 AM
Next, there is something interesting. After finishing training the TLDR model, one can simply remove the reward model head and re-attach the original language model head, to, obviously, become a new vision-language model. It's shown that these new models become better.
February 8, 2025 at 5:29 AM
TLDR has rich usefulness. First, it can serve as a hallucination rate evaluation metric. As shown in the table, GPT-4o is still the best vision language model in the token level while open-weight models such as Llama-3.2-90B is catching up in the sentence and response level.
February 8, 2025 at 5:29 AM
TLDR is trained on synthetic hard negatives generated via a perturbation-based method. The architecture is very simple. Instead of applying the reward model head to the last token, as many RMs are doing, TLDR applies the reward model head to every token.
February 8, 2025 at 5:29 AM
Excited to share that my intern work at Meta GenAI is accepted to @iclr-conf.bsky.social #ICLR2025

Introducing TLDR: Token-Level Detective Reward Model For Large Vision Language Models.

TLDR provides fine-grained annotations to
each text token.

🔗arXiv: arxiv.org/abs/2410.04734
February 8, 2025 at 5:29 AM