Florian Dorner
flodorner.bsky.social
Florian Dorner
@flodorner.bsky.social
PhD student in CS @ ETHZ / MPI-IS

Theory of ML evaluation https://flodorner.github.io/
Does anyone have background on this plot, compared to the 32% performance for o3-mini-high with tool use claimed by OpenAI in January? #GPT5 #GPT-5

openai.com/index/introd...
openai.com/index/openai...
August 8, 2025 at 9:28 AM
April 24, 2025 at 1:36 AM
In two hours, Ricardo is giving a talk about our paper on training on the test task, and its confounding impacts on LLM benchmarking 📉📈. (Session 1B) arxiv.org/abs/2407.07890
April 24, 2025 at 1:36 AM
Starting to believe @natolambert.bsky.social's take that the o1 plots are misleading [1] (in the sense that OpenAI cannot fully control test compute at inference time). In particular, it seems like scaling up test compute might require extensive retraining.

[1] www.interconnects.ai/p/openais-o1...
January 21, 2025 at 10:57 AM
I meant Figure 2 in the R1 report looks like the left o1 plot if you squint hard enough (and consider the x-axis is linear rather than logarithmic)
January 20, 2025 at 3:55 PM