Nicholas Runcie
banner
nicholasruncie.bsky.social
Nicholas Runcie
@nicholasruncie.bsky.social
AI for drug design | Oxford DPhil student | MChem | RSci | opinions my own
📝 Read the preprint “Assessing the Chemical Intelligence of Large Language Models” 👉 arxiv.org/abs/2505.07735
Assessing the Chemical Intelligence of Large Language Models
Large Language Models are versatile, general-purpose tools with a wide range of applications. Recently, the advent of "reasoning models" has led to substantial improvements in their abilities in advan...
arxiv.org
May 13, 2025 at 7:03 PM
🪜 We believe these results signify a step change in the chemistry capabilities of LLMs. While current models are far from perfect, these models may soon, if not already, be able to solve a variety of problems in chemistry that were previously considered intractable (7/🧵)
May 13, 2025 at 7:03 PM
However, these models are still far from perfect.

🔢 Counting atoms is tricky for an LLM, equivalent to counting the “r”s in strawberry.
🌀 Models struggled when representing molecules in a non-canonical format
📉 Performance was better for molecules with fewer atoms

(6/🧵)
May 13, 2025 at 7:03 PM
ChemIQ more closely resembles real-world challenges, with tasks including:

📈 SAR analysis
⚗️ Reaction prediction
🚀 NMR elucidation

For example, o3-mini-high solved 74% of NMR spectra for molecules up to 10 atoms, and even solved one with 21 atoms. (5/🧵)
May 13, 2025 at 7:03 PM
💥 SMILES to IUPAC conversion has been notoriously challenging, with all previous LLMs scoring ~0% on this task. In our evaluation, o3-mini-high named ~30% of structures correctly. (4/🧵)
May 13, 2025 at 7:03 PM
💡Our results suggest that reasoning models now have the capacity to “think” about the structure of molecules, as opposed to talking about them superficially, enabling this step change in capabilities. (3/🧵)
May 13, 2025 at 7:03 PM
🧪 We built ChemIQ as a benchmark for molecular comprehension and reasoning. o3-mini-high answered 59% of questions correctly, substantially improving on GPT-4o’s 7%. Additionally, more reasoning consistently improved performance across all questions. (2/🧵)
May 13, 2025 at 7:03 PM