Lightnews — Scholar-powered news

Florian Dorner

@flodorner.bsky.social

77 followers 280 following 38 posts

PhD student in CS @ ETHZ / MPI-IS

Theory of ML evaluation https://flodorner.github.io/

Posts Replies Media Videos

Pinned

Florian Dorner @flodorner.bsky.social · Dec 5

Meet me at the Benchmarking workshop (sites.google.com/view/benchma...) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:

Florian Dorner

@flodorner.bsky.social

December 5, 2025 at 8:57 AM

Reposted by Florian Dorner

Yatong Chen

@yatongchen.bsky.social

I'll be @neuripsconf.bsky.social presenting Strategic Hypothesis Testing (spotlight!)

tldr: Many high-stakes decisions (e.g., drug approval) rely on p-values, but people submitting evidence respond strategically even w/o p-hacking. Can we characterize this behavior & how policy shapes it?

1/n

December 1, 2025 at 8:31 PM

Reposted by Florian Dorner

Tübingen AI Center

@tuebingen-ai.bsky.social

Congratulations also to Vivian Nastl (supervised by Moritz Hardt) and Ricardo Dominguez-Olmedo (Moritz Hardt and Bernhard Schölkopf) for winning 2025 Global Google PhD fellowships.
Find out more about their work here: is.mpg.de/en/news/vivi...

@maxplanckcampus.bsky.social @unituebingen.bsky.social

Vivian Nastl and Ricardo Dominguez-Olmedo receive 2025 Google Ph.D. Fellowship

Program supports exceptional graduate students working on innovative research in computer science and related fields

is.mpg.de

October 24, 2025 at 9:33 AM

Reposted by Florian Dorner

Michael Saxon

@saxon.me

The viral "Definition of AGI" paper tells you to read fake references which do not exist!

Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.

Take this as a warning to not use LMs to generate your references!

October 18, 2025 at 12:54 AM

Reposted by Florian Dorner

Yatong Chen

@yatongchen.bsky.social

We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social) are organizing the Workshop on the Science of Benchmarking & Evaluating AI @euripsconf.bsky.social 2025 in Copenhagen!

📢 Call for Posters: rb.gy/kyid4f
📅 Deadline: Oct 10, 2025 (AoE)
🔗 More info: rebrand.ly/bg931sf

September 22, 2025 at 1:45 PM

Reposted by Florian Dorner

Millicent Li

@millicentli.bsky.social

Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/8

September 17, 2025 at 7:19 PM

Florian Dorner

@flodorner.bsky.social

Does anyone have background on this plot, compared to the 32% performance for o3-mini-high with tool use claimed by OpenAI in January? #GPT5 #GPT-5

openai.com/index/introd...
openai.com/index/openai...

August 8, 2025 at 9:28 AM

Florian Dorner

@flodorner.bsky.social

New blogpost by my colleague Ricardo, arguing that instead of limiting data collection from big labs, LMArena should publicly release all data for everyone. ricardodominguez.github.io/blogs/arena....

How to Fix the Chatbot Arena? Release All Data

ricardodominguez.github.io

May 10, 2025 at 8:59 AM

Florian Dorner

@flodorner.bsky.social

In Singapore for #ICLR2025 and excited for two oral presentations on work I have contributed to! 🎉

April 24, 2025 at 1:36 AM

Florian Dorner

@flodorner.bsky.social

Starting to believe @natolambert.bsky.social's take that the o1 plots are misleading [1] (in the sense that OpenAI cannot fully control test compute at inference time). In particular, it seems like scaling up test compute might require extensive retraining.

[1] www.interconnects.ai/p/openais-o1...

January 21, 2025 at 10:57 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news