Florian Dorner
flodorner.bsky.social
Florian Dorner
@flodorner.bsky.social
PhD student in CS @ ETHZ / MPI-IS

Theory of ML evaluation https://flodorner.github.io/
Also, from time to time, the wrong proofs it suggests for more complicated things seem to contain non-trivial insights and are "fixable".
October 25, 2025 at 3:41 PM
Not much of a step up compared to the o1/o3 "thinking" versions of GPT-4. But quite a big step compared to base GPT-4. It still makes a lot of mistakes, but often produces correct proofs for simple Lemmata (not so much for more complicated stuff).
October 25, 2025 at 3:38 PM
Assuming all problems are actually solvable...
October 17, 2025 at 9:58 PM
Is that not trivially true, since LLMs assign nonzero probability to any possible string?
October 17, 2025 at 9:58 PM
Do you have a list of the best ones? I vaguely recall reading things in this direction, but cannot really remember specific titles.
September 21, 2025 at 8:11 PM
The focus on evaluating checkpoints during a training run rather than different trained models is super interesting!
September 17, 2025 at 5:16 AM
Interesting work! Can you comment a bit on what you do different compared to previous IRT-based LLM evaluation methods?

We recently did some work confirming IRTs efficacy for in-distribution models, but also found it to be quite brittle when it comes to novel models arxiv.org/abs/2506.07673
How Benchmark Prediction from Fewer Data Misses the Mark
Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM ev...
arxiv.org
September 17, 2025 at 5:11 AM
I guess in terms of the notation from section 4 in the paper, does this plot Type X risk, or Type X Error Feasibility rate?
September 14, 2025 at 2:52 PM
, at least for large n. So I am trying to understand whether the asymptotics kick in a lot slower than I would have thought, or whether I am missing something else about the setup., at least for large n.
September 14, 2025 at 2:44 PM
Thank you! Do I understand correctly that these results are independent/orthogonal from the success hacking ones? I guess my confusion stems from asymptotic theory for PPI (and by extension seemingly for DSL) suggesting that both type 1 and type 2 errors should be lower/at most very similar
September 14, 2025 at 2:44 PM
Are the reported errors for the case of selecting the model with the most significant results, post-hoc?
September 12, 2025 at 7:18 PM
Interesting work! Can you comment a bit more on the setup for the regression correction methods? As far as I understand, PPI++ (which should be quite similar to DSL) relatively reliably reduces variance compared to ground truth only, while remaining quite close to unbiased.
September 12, 2025 at 7:18 PM
Super interesting field, but worth keeping in mind that this usually only buys you a relatively small fraction of "extra ground truth labels" (this does not cover active sampling strategies, but I haven not seen them yielding much larger improvements in practice, either) arxiv.org/abs/2410.13341
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...
arxiv.org
July 23, 2025 at 1:28 PM
Do you have a source re: attendance requirement? 👀
July 17, 2025 at 5:28 PM
Not sure this can ethically be done retroactively (due to participant consent). But given that 20% of data is shared with model providers, privacy concerns with instead sharing this data publically in the future seem surmountable.
May 10, 2025 at 8:59 AM
Is this just the prompts, or do model providers get information about whether or not they won (and the competing response)?
April 30, 2025 at 2:56 PM
Shout out to my colleagues Ricardo Dominguez-Olmedo, Vivian Nastl and Moritz Hardt! If you’d like to chat at the conference, send me a message, or visit us at one of the poster sessions!
April 24, 2025 at 1:36 AM
April 24, 2025 at 1:36 AM
Tomorrow, I will speak about our work on the limitations of LLM-as-a-Judge 🤖 when applied to evaluating frontier models. (Session 3D)
arxiv.org/abs/2410.13341
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...
arxiv.org
April 24, 2025 at 1:36 AM