Danny Sawyer
dannypsawyer.bsky.social
Danny Sawyer
@dannypsawyer.bsky.social
AI researcher @GoogleDeepMind. PhD @Caltech. Interested in autonomous exploration and self-improvement, both in humans and embodied AI agents. Views my own.
Thanks to all the authors! @janexwang 13/13
October 10, 2025 at 5:11 PM
In summary, our work provides a deeper understanding of the exploration and adaptation capabilities of frontier models. We show that these skills, while not yet robust, can be elicited.

Read the full paper for all the details!
arxiv.org/abs/2412.06438
#NeurIPS2025 12/13
Can foundation models actively gather information in interactive environments to test hypotheses?
Foundation models excel at single-turn reasoning but struggle with multi-turn exploration in dynamic environments, a requirement for many real-world challenges. We evaluated these models on their abil...
arxiv.org
October 10, 2025 at 5:11 PM
This reveals that a major frontier for foundation agents isn't just acting, but reflecting. The ability to improve through adaptive strategies over time is challenging, but not fundamentally out of reach.

Benchmarks like Alchemy are crucial for measuring this progress. 11/13
October 10, 2025 at 5:11 PM
We took it a step further: strategy adaptation. We silently changed the environment's rules mid-episode.

We found some models, like Gemini 2.5 and Claude 3.7, when aided by summarization, could detect the change and successfully adapt their strategy, recovering performance. 10/13
October 10, 2025 at 5:11 PM
With the summarization prompt, a latent meta-learning ability emerged. Models now showed significant score improvement across trials.

The act of summarizing forced them to consolidate their knowledge, enabling them to form and execute better strategies in later trials. 8/13
October 10, 2025 at 5:11 PM
This led to our key insight. We hypothesized the models weren't actively distilling principles from their long action history.

So, we prompted them to write a summary of their findings after each trial. The effect was dramatic. 8/13
October 10, 2025 at 5:11 PM
But in the complex Alchemy environment, performance faltered. Without guidance, even the most powerful models showed no significant improvement across trials.

They gathered data but failed to integrate it into a better strategy. Meta-learning did not occur naturally. 7/13
October 10, 2025 at 5:11 PM
In the simple Feature World tasks, most models performed near-optimally. They are highly efficient at gathering information when the goal is straightforward.

This shows the challenge isn't basic, single-turn reasoning. They can select informative actions in the moment. 6/13
October 10, 2025 at 5:11 PM
2️⃣ Alchemy: A multi-trial environment that requires agents to deduce latent causal rules and improve their strategy over time. The rules are random, but stay the same across trials.

This isolates different facets of exploration from Feature World. 5/13
October 10, 2025 at 5:11 PM
We evaluated models in two environments:
1️⃣ Feature World (both text-based and 3D in Construction Lab): A stateless setting to test raw information-gathering efficiency. 4/13
October 10, 2025 at 5:11 PM
These patterns of failures offer interesting insights into how foundation models function, and also point toward ways to unlock these core embodied exploration abilities. 3/13
October 10, 2025 at 5:11 PM
We benchmarked variants of GPT, Claude, and Gemini on exploration in several embodied environments. Surprisingly, although most models did well on stateless, single-turn tasks, many had critical limitations in adaptation and meta-learning in stateful, multi-turn tasks. 2/13
October 10, 2025 at 5:11 PM