Lightnews — Scholar-powered news

Andreas Hochlehnert

@ahochlehnert.bsky.social

180 followers 81 following 17 posts

PhD student in ML at Tübingen AI Center & International Max-Planck Research School for Intelligent Systems

Posts Replies Media Videos

Pinned

Andreas Hochlehnert @ahochlehnert.bsky.social · Apr 10

🧵1/ 🚨 New paper: A Sober Look at Progress in Language Model Reasoning
We re-evaluate recent SFT and RL models for mathematical reasoning and find most gains vanish under rigorous, multi-seed, standardized evaluation.

📊 bethgelab.github.io/sober-reason...
📄 arxiv.org/abs/2504.07086

Andreas Hochlehnert

@ahochlehnert.bsky.social

Presenting A Sober Look at Progress in LM Reasoning at @colmweb.org today 🇨🇦 #COLM2025

📅 Today
🕔 11:00 AM – 1:00 PM
📍 Room 710 - Poster #31

We find that many “reasoning” gains fall within variance and show how to make evaluation reproducible again.
📘 bethgelab.github.io/sober-reasoning

October 8, 2025 at 12:37 PM

Reposted by Andreas Hochlehnert

Andreas Geiger

@andreasgeiger.bsky.social

Excited about this new work from @haoyuhe.bsky.social. TLDR: Diffusion language models treat learning and inference differently which lowers performance. RL can be used to overcome this issue for certain problems.

haoyuhe.bsky.social @haoyuhe.bsky.social · Aug 20

🚀 Introducing our new paper, MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models.

📄 Paper: www.scholar-inbox.com/papers/He202...
arxiv.org/pdf/2508.13148
💻 Code: github.com/autonomousvi...
🌐 Project Page: cli212.github.io/MDPO/

August 20, 2025 at 8:25 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

April 10, 2025 at 3:36 PM

Reposted by Andreas Hochlehnert

Prasanna Mayilvahanan

@prasannamayil.bsky.social

New preprint out! 🎉

How does LLM training loss translate to downstream performance?

We show that pretraining data and tokenizer shape loss-to-loss scaling, while architecture and other factors play a surprisingly minor role!
brendel-group.github.io/llm-line/ 🧵1/8

February 18, 2025 at 2:09 PM

Andreas Hochlehnert

@ahochlehnert.bsky.social

CuratedThoughts: Data Curation for RL Datasets 🚀

Since DeepSeek-R1 introduced reasoning-based RL, datasets like Open-R1 & OpenThoughts emerged for fine-tuning & GRPO. Our deep dive found major flaws — 25% of OpenThoughts needed elimination by data curation.

Here's why 👇🧵

February 17, 2025 at 6:22 PM

Reposted by Andreas Hochlehnert

Ofir Press

@ofirpress.bsky.social

SWE-bench Multimodal evaluation code is out now!

SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).

www.swebench.com/sb-cli/

January 17, 2025 at 9:06 AM

Andreas Hochlehnert

@ahochlehnert.bsky.social

We are presenting CiteMe today at the 11AM poster session (East Exhibit Hall A-C, #3309)

CiteMe is a challenging benchmark for LM-based agents to find paper citations, moving beyond simple multiple-choice Q&A to real-world use cases.

Come by and say hi :)

citeme.ai

CiteME

CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.

citeme.ai

December 13, 2024 at 4:18 PM