#FrontierMath
New math benchmark from math.science-bench.ai :

209 research-level mathematics problems from Combinatorics, Algebra, Geometry, Number Theory, and others.

👉 math.science-bench.ai/benchmarks/

#AI #Mathematics #AIBenchmark #EpochAI #FrontierMath #OpenAI #Gemini #Grok
November 1, 2025 at 2:18 PM
If you ran GPT-5 infinitely many times on FrontierMath—our extremely challenging math benchmark—would it eventually solve every problem?

Probably not. From what we can tell, it caps out below 50%.

What about throwing in *every* available model? Infinitely many times? 🧵
October 17, 2025 at 4:56 PM
Most of these 57% of problems are solved in multiple buckets. ChatGPT Agent stands out as having the most that it alone solved, likely because it alone has web search. Web search is valid for FrontierMath: problems are not public and consulting online resources is allowed.
October 17, 2025 at 4:56 PM
FrontierMath Tier 4 consists of 50 research-level math problems developed by professional mathematicians. These problems can take experts weeks to solve. Below is one of the two public samples. We evaluate on the other 48.
October 10, 2025 at 4:26 PM
We evaluated Gemini 2.5 Deep Think on FrontierMath. There is no API, so we ran it manually. The results: a new record!

We also conducted a more holistic evaluation of its math capabilities. 🧵
October 9, 2025 at 5:32 PM
I yet haven't seen any analysis if o3 is good in Frontiermath because its knowledge or because "intelligence". It's hard to answer without knowing questions but it sounds like it's still largely due to knowledge and the intelligence is poor compared to humans. Especially with the failures on ARC-AGI
December 23, 2024 at 10:00 AM
It's for questions like these that physicists and mathematicians from around the world are lining up to create a new AI test for math and physics questions that require deep thinking.
FrontierMath funded by OpenAI which was never Open. epoch.ai/blog/openai-...
Clarifying the Creation and Use of the FrontierMath Benchmark
We clarify that OpenAI commissioned Epoch AI to produce 300 math questions for the FrontierMath benchmark. They own these and have access to the statements and solutions, except for a 50-question hold...
epoch.ai
February 1, 2025 at 3:11 AM
FrontierMath Was Funded by OpenAI
L: https://www.lesswrong.com/posts/cu2E8wgmbdZbqeWqb/meemi-s-shortform
C: https://news.ycombinator.com/item?id=42755217
posted on 2025.01.19 at 03:50:44 (c=1, p=3)
January 19, 2025 at 9:39 AM
OpenAI franchit un cap avec o3-mini High: 32% de réussite sur FrontierMath! 🎯 L'IA résout des problèmes mathématiques complexes comme un pro, utilisant Python pour ses calculs. Une révolution dans le monde des maths! 🧮🤖 #IA #Innovation #Mathématiques https://patb.ca/r/cq0
February 2, 2025 at 6:05 PM
ニュースだよ~

>OpenAIのo3モデルが数学の超難問データセット「FrontierMath」で25.2%のスコアを獲得した衝撃を数学者が語る
- https://gigazine.net/news/20241225-ai-frontiermath/
December 25, 2024 at 12:53 AM
FrontierMath is funded by OpenAI, who have access to problem statements & solutions except for a 53-question holdout set. Our evaluations work is funded by the UK AI Security Institute.

We gratefully acknowledge both @OpenAI and @AISecurityInst.
March 12, 2025 at 4:00 PM
Independent tests show OpenAI's public o3 model scored ~10% on the FrontierMath benchmark. This contrasts with the >25% figure previously cited, which likely used a more compute intensive internal version. 🩺💻 #MLSky
OpenAI's o3 AI model scores lower on a benchmark than the company initially implied | TechCrunch
A discrepancy between first- and third-party benchmark results for OpenAI's o3 AI model is raising questions about the company's transparency and model testing practices.
techcrunch.com
April 21, 2025 at 2:30 PM
…yes, it is a sample problem designed for the FrontierMath benchmark I was referencing. The Arxiv you published is a description of the benchmark which includes the sample problem. The remaining problems are secret and not published (for obvious reasons).
December 22, 2024 at 6:11 AM
agreeing that benchmarks have a 'necessary not sufficient' quality but read a little about frontiermath specifically, a set of math problems designed to require mathematical creativity by elite mathematicians
December 21, 2024 at 7:23 PM
Learn more about the benchmark here: https://epoch.ai/frontiermath
FrontierMath
FrontierMath is a benchmark of hundreds of unpublished and extremely challenging math problems to help us to understand the limits of artificial intelligence.
epoch.ai
December 26, 2024 at 3:56 AM

Seems like a big deal, Matthew Barnett from Epoch AI argued on FrontierMath's release that passing FrontierMath would be a reasonable bar for having reached AGI:
x.com
x.com
December 20, 2024 at 7:29 PM
2411.04872
FrontierMathは、数学の専門家たちによって作成され、吟味された、独創的で非常に難易度の高い数百の数学問題からなるベンチマークである。数論や実解析における計算量の多い問題から、代数幾何学や圏論における抽象的な問題まで、現代数学の主要な分野のほとんどをカバーしている。典型的な問題を解くには、...
December 6, 2024 at 12:06 AM
How embarrassing! I have to take back that I submitted or have seem problems from #FrontierMath.

I just reveived an email from Humanity's Last Exam, a similar database, not restricted to mathematics, and realized that I contributed to that dataset instead!

#math #MathSky #LLM #AI
January 11, 2025 at 7:33 AM
OpenAI funded FrontierMath, forbade them from disclosing it before o3 was revealed, and had access to the problem set excluding a separate hold out blind set. They updated their Arxiv to reflect this after Dec 20th 👇

www.lesswrong.com/posts/cu2E8w...
meemi's Shortform — LessWrong
Comment by Tamay - Tamay from Epoch AI here. We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 la...
www.lesswrong.com
January 19, 2025 at 7:26 PM
2025年5月中旬、世界的数学者30名が極秘に集結した。“無敵の院生”のごときAIと腕比べをするためだ。
場所はバークレー、まず出されたのは、出題者自身しか解けないはずの数論や解析の超難問。ところが o4-mini は関連文献を2分で把握し、10分後にはちゃっかり正解を提示。
一年前まで「LLMは計算が苦手」と高をくくっていた専門家ほど戦慄した。ベンチマーク FrontierMath で正答率2%→20%へ急伸し、Tier4まで攻略。
研究者はAIが「Tier5=人類未解決問題」に備え、創造性教育の重要性と“証明 by 威圧”のリスクを議論した。
Inside the Secret Meeting Where Mathematicians Struggled to Outsmart AI
The world's leading mathematicians were stunned by how adept artificial intelligence is at doing their jobs
buff.ly
June 9, 2025 at 8:09 AM
At least one example is published online: epoch.ai/frontiermath...
It's just a textbook-style math problem, but the conditions they place on the problems bar me from claiming my $7500, since I ask for facts, to be answered by proofs or references rather than scripts.
June 7, 2025 at 6:38 PM
My view is that the bar is fuzzy, and that we're already into the fuzziness now.
December 20, 2024 at 5:35 PM
We have not yet evaluated o3 due to cost (>10k USD), Gemini 2.5 Pro due to the same bugs in the Gemini API we encountered on FrontierMath (see below), and DeepSeek-R1 due to lack of function calling support.

x.com/tmkadamcz/st...
x.com
May 15, 2025 at 3:47 PM