Gabe Grand
@gabegrand.bsky.social
PhD student @csail.mit.edu 🤖 & 🧠
Paper + code + interactive demos: gabegrand.github.io/battleship ⚓️🎯
October 27, 2025 at 7:17 PM
Paper + code + interactive demos: gabegrand.github.io/battleship ⚓️🎯
Special shoutout to @valeriopepe.bsky.social (co-first author), who is super talented and currently on the PhD job market!
October 27, 2025 at 7:17 PM
Special shoutout to @valeriopepe.bsky.social (co-first author), who is super talented and currently on the PhD job market!
Thanks to Valerio Pepe, Josh Tenenbaum, and Jacob Andreas for long-horizon collaboration and planning: this line of Battleship work has been *2 years* in the making!
October 27, 2025 at 7:17 PM
Thanks to Valerio Pepe, Josh Tenenbaum, and Jacob Andreas for long-horizon collaboration and planning: this line of Battleship work has been *2 years* in the making!
Bottom line: The future of AI-driven discovery isn't just bigger models—it's smarter inference. By combining LMs with rational planning strategies, we can build agents that ask better questions, make better decisions, and collaborate effectively with humans.
October 27, 2025 at 7:17 PM
Bottom line: The future of AI-driven discovery isn't just bigger models—it's smarter inference. By combining LMs with rational planning strategies, we can build agents that ask better questions, make better decisions, and collaborate effectively with humans.
Why does this matter? Discovery-driven AI (scientific experiments, theorem proving, drug discovery) requires hitting needles in combinatorially vast haystacks. If we want agents that explore rationally, we need to go beyond prompting.
October 27, 2025 at 7:17 PM
Why does this matter? Discovery-driven AI (scientific experiments, theorem proving, drug discovery) requires hitting needles in combinatorially vast haystacks. If we want agents that explore rationally, we need to go beyond prompting.
Key takeaway: Current LMs aren’t rational information seekers: they struggle to ground answers in context, generate informative queries, and balance exploration vs. exploitation. But Bayesian inference at test time can dramatically close these gaps—efficiently.
October 27, 2025 at 7:17 PM
Key takeaway: Current LMs aren’t rational information seekers: they struggle to ground answers in context, generate informative queries, and balance exploration vs. exploitation. But Bayesian inference at test time can dramatically close these gaps—efficiently.
Does this generalize? YES. We replicated on "Guess Who?" from TextArena and saw similar gains: GPT-4o (61.7% → 90.0%), Llama-4-Scout (30.0% → 72.4%). The framework works across information-seeking domains with combinatorial hypothesis spaces.
October 27, 2025 at 7:17 PM
Does this generalize? YES. We replicated on "Guess Who?" from TextArena and saw similar gains: GPT-4o (61.7% → 90.0%), Llama-4-Scout (30.0% → 72.4%). The framework works across information-seeking domains with combinatorial hypothesis spaces.
Deciding when to explore vs. act is also key. Skilled players (humans + GPT-5) spread out questions over the course of the game. Weak LMs spam all 15 upfront. The key isn't asking MORE—it's asking BETTER questions at the RIGHT time. Quality > quantity.
October 27, 2025 at 7:17 PM
Deciding when to explore vs. act is also key. Skilled players (humans + GPT-5) spread out questions over the course of the game. Weak LMs spam all 15 upfront. The key isn't asking MORE—it's asking BETTER questions at the RIGHT time. Quality > quantity.
Here's the kicker: asking high-EIG questions alone doesn't guarantee wins. Weaker models struggle to convert information into good moves. Bayes-M—which explicitly marginalizes over beliefs—is crucial for translating questions into action.
October 27, 2025 at 7:17 PM
Here's the kicker: asking high-EIG questions alone doesn't guarantee wins. Weaker models struggle to convert information into good moves. Bayes-M—which explicitly marginalizes over beliefs—is crucial for translating questions into action.
Our approach leverages inference scaling to enable models to ask more informative questions. Bayes-Q boosts EIG by up to 0.227 bits (94.2% of the theoretical ceiling) and virtually eliminates redundant questions (18.5% → 0.2% for Llama-4-Scout).
October 27, 2025 at 7:17 PM
Our approach leverages inference scaling to enable models to ask more informative questions. Bayes-Q boosts EIG by up to 0.227 bits (94.2% of the theoretical ceiling) and virtually eliminates redundant questions (18.5% → 0.2% for Llama-4-Scout).
In head-to-head comparisons, both GPT-4o and Llama-4-Scout now beat GPT-5 while costing 2.8x and 99.7x less, respectively.
October 27, 2025 at 7:17 PM
In head-to-head comparisons, both GPT-4o and Llama-4-Scout now beat GPT-5 while costing 2.8x and 99.7x less, respectively.
With all three Bayesian components (+Bayes-QMD), Llama-4-Scout jumps from near-random guessing (0.367 F1) to super-human level (0.764 F1). GPT-4o sees similar gains (0.450 → 0.782 F1). The deltas are really striking.
October 27, 2025 at 7:17 PM
With all three Bayesian components (+Bayes-QMD), Llama-4-Scout jumps from near-random guessing (0.367 F1) to super-human level (0.764 F1). GPT-4o sees similar gains (0.450 → 0.782 F1). The deltas are really striking.
We developed three Bayesian strategies inspired by Bayesian Experimental Design (BED):
❓ Question (Bayes-Q): Optimizes expected info gain (EIG)
🎯 Move (Bayes-M): Maximizes hit probability
⚖️ Decision (Bayes-D): Decides when to ask vs. shoot using one-step lookahead
❓ Question (Bayes-Q): Optimizes expected info gain (EIG)
🎯 Move (Bayes-M): Maximizes hit probability
⚖️ Decision (Bayes-D): Decides when to ask vs. shoot using one-step lookahead
October 27, 2025 at 7:17 PM
We developed three Bayesian strategies inspired by Bayesian Experimental Design (BED):
❓ Question (Bayes-Q): Optimizes expected info gain (EIG)
🎯 Move (Bayes-M): Maximizes hit probability
⚖️ Decision (Bayes-D): Decides when to ask vs. shoot using one-step lookahead
❓ Question (Bayes-Q): Optimizes expected info gain (EIG)
🎯 Move (Bayes-M): Maximizes hit probability
⚖️ Decision (Bayes-D): Decides when to ask vs. shoot using one-step lookahead
In our second set of experiments, we turned to the challenge of building rational question-asking agents to play the Captain role.
October 27, 2025 at 7:17 PM
In our second set of experiments, we turned to the challenge of building rational question-asking agents to play the Captain role.
We find that having models write Python functions to answer questions boosts accuracy by +14.7% (absolute p.p.), and complements CoT reasoning.
October 27, 2025 at 7:17 PM
We find that having models write Python functions to answer questions boosts accuracy by +14.7% (absolute p.p.), and complements CoT reasoning.
One useful trick to improve answering accuracy is to use code generation. Code grounds reasoning in executable logic, not just vibes.
October 27, 2025 at 7:17 PM
One useful trick to improve answering accuracy is to use code generation. Code grounds reasoning in executable logic, not just vibes.
Many LMs really struggle with questions that require grounding answers in the board and dialogue context. GPT-4o drops from 72.8% → 60.4% accuracy on context-dependent questions. Llama-4-Scout: 68.0% → 54.0%. Humans? Basically flat (92.8% vs 91.9%).
October 27, 2025 at 7:17 PM
Many LMs really struggle with questions that require grounding answers in the board and dialogue context. GPT-4o drops from 72.8% → 60.4% accuracy on context-dependent questions. Llama-4-Scout: 68.0% → 54.0%. Humans? Basically flat (92.8% vs 91.9%).
Overall, humans are really reliable at answering questions on BattleshipQA (92.5% accuracy). In contrast, LM accuracy ranges widely—from near-random (52.5%, GPT-4o-mini) to human-level (92.8%, o3-mini). But there's a catch…
October 27, 2025 at 7:17 PM
Overall, humans are really reliable at answering questions on BattleshipQA (92.5% accuracy). In contrast, LM accuracy ranges widely—from near-random (52.5%, GPT-4o-mini) to human-level (92.8%, o3-mini). But there's a catch…
In our first experiment, we looked at QA accuracy in the Spotter role – this is an important sanity-check for how well players (humans & agents) can understand and reason about the game state.
October 27, 2025 at 7:17 PM
In our first experiment, we looked at QA accuracy in the Spotter role – this is an important sanity-check for how well players (humans & agents) can understand and reason about the game state.
To understand how people strategize & collaborate, we ran a two-player synchronous human study (N=42) and collected full action trajectories and chat dialogues. Our “BattleshipQA” dataset provides a rich, multimodal benchmark for comparing human and agent behavior.
October 27, 2025 at 7:17 PM
To understand how people strategize & collaborate, we ran a two-player synchronous human study (N=42) and collected full action trajectories and chat dialogues. Our “BattleshipQA” dataset provides a rich, multimodal benchmark for comparing human and agent behavior.
We created “Collaborative Battleship”—a two-player game where a Captain (who only sees a partial board) must balance asking questions vs. taking shots, while a Spotter (who sees everything) can only answer Yes/No. It's deceptively simple but cognitively demanding.
October 27, 2025 at 7:17 PM
We created “Collaborative Battleship”—a two-player game where a Captain (who only sees a partial board) must balance asking questions vs. taking shots, while a Spotter (who sees everything) can only answer Yes/No. It's deceptively simple but cognitively demanding.
But LMs are trained to *answer* queries, not *ask* them. Can they learn to explore intelligently?
October 27, 2025 at 7:17 PM
But LMs are trained to *answer* queries, not *ask* them. Can they learn to explore intelligently?
Many high-stakes AI applications require asking data-driven questions—think scientific discovery, medical diagnosis, or drug development.
October 27, 2025 at 7:17 PM
Many high-stakes AI applications require asking data-driven questions—think scientific discovery, medical diagnosis, or drug development.