Lightnews — Scholar-powered news

Florian Dorner

@flodorner.bsky.social

Also, from time to time, the wrong proofs it suggests for more complicated things seem to contain non-trivial insights and are "fixable".

October 25, 2025 at 3:41 PM

Florian Dorner

@flodorner.bsky.social

Not much of a step up compared to the o1/o3 "thinking" versions of GPT-4. But quite a big step compared to base GPT-4. It still makes a lot of mistakes, but often produces correct proofs for simple Lemmata (not so much for more complicated stuff).

October 25, 2025 at 3:38 PM

Florian Dorner

@flodorner.bsky.social

Assuming all problems are actually solvable...

October 17, 2025 at 9:58 PM

Florian Dorner

@flodorner.bsky.social

Is that not trivially true, since LLMs assign nonzero probability to any possible string?

October 17, 2025 at 9:58 PM

Florian Dorner

@flodorner.bsky.social

Do you have a list of the best ones? I vaguely recall reading things in this direction, but cannot really remember specific titles.

September 21, 2025 at 8:11 PM

Florian Dorner

@flodorner.bsky.social

The focus on evaluating checkpoints during a training run rather than different trained models is super interesting!

September 17, 2025 at 5:16 AM

Florian Dorner

@flodorner.bsky.social

Interesting work! Can you comment a bit on what you do different compared to previous IRT-based LLM evaluation methods?

We recently did some work confirming IRTs efficacy for in-distribution models, but also found it to be quite brittle when it comes to novel models arxiv.org/abs/2506.07673

How Benchmark Prediction from Fewer Data Misses the Mark

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM ev...

arxiv.org

September 17, 2025 at 5:11 AM

Florian Dorner

@flodorner.bsky.social

I guess in terms of the notation from section 4 in the paper, does this plot Type X risk, or Type X Error Feasibility rate?

September 14, 2025 at 2:52 PM

Florian Dorner

@flodorner.bsky.social

, at least for large n. So I am trying to understand whether the asymptotics kick in a lot slower than I would have thought, or whether I am missing something else about the setup., at least for large n.

September 14, 2025 at 2:44 PM

Florian Dorner

@flodorner.bsky.social

Thank you! Do I understand correctly that these results are independent/orthogonal from the success hacking ones? I guess my confusion stems from asymptotic theory for PPI (and by extension seemingly for DSL) suggesting that both type 1 and type 2 errors should be lower/at most very similar

September 14, 2025 at 2:44 PM

Florian Dorner

@flodorner.bsky.social

Are the reported errors for the case of selecting the model with the most significant results, post-hoc?

September 12, 2025 at 7:18 PM

Florian Dorner

@flodorner.bsky.social

Interesting work! Can you comment a bit more on the setup for the regression correction methods? As far as I understand, PPI++ (which should be quite similar to DSL) relatively reliably reduces variance compared to ground truth only, while remaining quite close to unbiased.

September 12, 2025 at 7:18 PM

Florian Dorner

@flodorner.bsky.social

Super interesting field, but worth keeping in mind that this usually only buys you a relatively small fraction of "extra ground truth labels" (this does not cover active sampling strategies, but I haven not seen them yielding much larger improvements in practice, either) arxiv.org/abs/2410.13341

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...

arxiv.org

July 23, 2025 at 1:28 PM

Florian Dorner

@flodorner.bsky.social

Do you have a source re: attendance requirement? 👀

July 17, 2025 at 5:28 PM

Florian Dorner

@flodorner.bsky.social

Not sure this can ethically be done retroactively (due to participant consent). But given that 20% of data is shared with model providers, privacy concerns with instead sharing this data publically in the future seem surmountable.

May 10, 2025 at 8:59 AM

Florian Dorner

@flodorner.bsky.social

Is this just the prompts, or do model providers get information about whether or not they won (and the competing response)?

April 30, 2025 at 2:56 PM

Florian Dorner

@flodorner.bsky.social

Shout out to my colleagues Ricardo Dominguez-Olmedo, Vivian Nastl and Moritz Hardt! If you’d like to chat at the conference, send me a message, or visit us at one of the poster sessions!

April 24, 2025 at 1:36 AM

Florian Dorner

@flodorner.bsky.social

April 24, 2025 at 1:36 AM

Florian Dorner

@flodorner.bsky.social

Tomorrow, I will speak about our work on the limitations of LLM-as-a-Judge 🤖 when applied to evaluating frontier models. (Session 3D)
arxiv.org/abs/2410.13341

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an importan...

arxiv.org

April 24, 2025 at 1:36 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news