Toby Ord
banner
tobyord.bsky.social
Toby Ord
@tobyord.bsky.social
Senior Researcher at Oxford University.
Author — The Precipice: Existential Risk and the Future of Humanity.
tobyord.com
And here are the relative boosts.
Overall the inference scaling produced 82%, 63%, and 92% of the total performance gains on the different benchmarks.
12/
October 3, 2025 at 7:38 PM
As you can see, most of the boost is coming from the inference-scaling that the RL training has enabled.
The same is true for the other benchmarks I examined. Here are the raw scatterplots:
11/
October 3, 2025 at 7:37 PM
We can draw the trend on the chart, then divide the performance boost in two:
• the RL boost taking the base model to the trend line
• the inference-scaling boost taking it to the top of the trend
10/
October 3, 2025 at 7:37 PM
I worked out a nice clean way to separate this out. Here is data from the MATH level 5 benchmark, showing performance vs token-use for a base model (Sonnet 3.6 – orange square) and its reasoning model (Sonnet 3.7 – red circles).
8/
October 3, 2025 at 7:35 PM
This gives us a new way of evaluating infinite sums and integrals that refines the standard method, such that two sums which were standardly equal to +∞ can now have different positive infinite values. And the values they get are very intuitive:
6/
September 25, 2025 at 3:32 PM
We just need to generalise the standard finite sum and integral from 𝑎 up to 𝑏 such that 𝑎 and 𝑏 can be hyperreal numbers, and then use this alternative kind of summation/integration to evaluate infinite sums and integrals (taking them up to the hyperreal number ω).
5/
September 25, 2025 at 3:31 PM
After a small launch event at Balliol College, Oxford, we were shocked to see the scale of media attention. People were interested!
August 15, 2025 at 11:04 AM
In early 2009 I met Will MacAskill, who immediately got it, and was key to making it actually happen. We worked day and night through the summer and officially launched Giving What We Can in November that year — with 23 initial members who had taken the 10% Pledge:
August 15, 2025 at 11:04 AM
I still have my journal entry from way back in February 2006 when I first publicly discussed founding a society based around a pledge to give:
August 15, 2025 at 11:00 AM
I'm overjoyed to see that more than 10,000 people have joined me in pledging 10% of their lifetime income to help others as effectively as they can. We're each able to do so much — and so much more together.
🧵 @givingwhatwecan.bsky.social
August 15, 2025 at 10:59 AM
And how many years of delay there would be between the agents completing a task-length with 50% success rate and completing it with a higher success rate:
14/
May 7, 2025 at 4:00 PM
Moreover, it allows us to estimate how short the task would need to be to reach higher success rates (where T_50 is the time-horizon for a 50% success rate etc):
13/
May 7, 2025 at 4:00 PM
And that is what we see. The exponential (dotted blue line) fits METR’s empirical success rates for different models (coloured bars) about as well as the more complex log-logistic distribution METR used (black), despite having fewer parameters and being less arbitrary.
10/
May 7, 2025 at 3:58 PM
METR recently released an intriguing report showing that on a suite of tasks related to doing AI research, the length of tasks that frontier AI agents can complete has been doubling every 7 months.
2/
May 7, 2025 at 3:52 PM
I tried searching for that exact query and one of the top pages is this page from the Hindustan Times that explains the puzzle and its solution.
6/
April 23, 2025 at 2:27 PM
But a little bit before that point, you can see that it simply searched the internet for the solution:
5/
April 23, 2025 at 2:27 PM
The first part of its reasoning starts off treating it like a maths puzzle, but you can isolate the point in the transcript where it starts to entertain the lateral solution. It makes it sound like it has a new idea:
4/
April 23, 2025 at 2:26 PM
After a lot of internal reasoning, it ends up presenting the user this correct answer:
3/
April 23, 2025 at 2:26 PM
The puzzle is drawn on a piece of paper and o3 correctly works out what is drawn there and what is being asked:
2/
April 23, 2025 at 2:25 PM
Most importantly, In my earlier thread, I noted that the o3 models looked like they had broken the existing trend of the o1 models (i.e. still showing only logarithmic returns, but with a better constant). But on the updated chart, o3 is barely above the o1 trend:
April 1, 2025 at 2:09 PM
The ARC-AGI team found that the o3 price estimate was only a tenth of what OpenAI were charging for the inferior o1-pro model, so updated the price estimate to use the price of o1-pro until the actual o3 is released and its true price is known.
3/n
April 1, 2025 at 2:06 PM
Here is the revised ARC-AGI plot. They've increased their cost-estimate of the original o3 low from $20 per task to $200 per task. Presumably o3 high has similarly gone from $3,000 to $30,000 per task, which is why it breaks their $10,000 per task limit and is no longer included.
2/n
April 1, 2025 at 2:03 PM
And this was directly approved by their CEO Mark Zuckerberg:
March 24, 2025 at 10:48 AM
They knew this was illegal and were worried about being arrested if they did it:
March 24, 2025 at 10:44 AM
It is stunning to see how Meta illegally downloaded billions of pages of copyrighted books and articles from Russian pirate sites when training Llama 3. And not only that, but Meta also directly redistributed that copyrighted data to others:
March 24, 2025 at 10:40 AM