✅ An open-source man-in-the-middle testbed for real web environments
✅ A scalable consumer choice benchmark for agentic decision-making
✅ A dataset of causal effects of ratings, prices, and nudges across 17 LLMs
📦 Code: github.com/PapayaResearch/abxlab
🧵8/9
✅ An open-source man-in-the-middle testbed for real web environments
✅ A scalable consumer choice benchmark for agentic decision-making
✅ A dataset of causal effects of ratings, prices, and nudges across 17 LLMs
📦 Code: github.com/PapayaResearch/abxlab
🧵8/9
“What governs its decisions when multiple valid options exist?”
A question behavioral scientists have been asking about humans for decades. ABxLAB is a step toward that science for agents.
🧵7/9
“What governs its decisions when multiple valid options exist?”
A question behavioral scientists have been asking about humans for decades. ABxLAB is a step toward that science for agents.
🧵7/9
These act like switches: once a preference is declared, it dominates all other attributes.
The takeaway isn’t that agents are biased shoppers, but that this offers a diagnostic window into agent behavior.
🧵6/9
These act like switches: once a preference is declared, it dominates all other attributes.
The takeaway isn’t that agents are biased shoppers, but that this offers a diagnostic window into agent behavior.
🧵6/9
- Heavily over-weight ratings
- Over-weight cheaper items when ratings are matched
- Are swayed by trivial order effects
- Fall for simple nudges (e.g. “Best seller”)
These are systematic, often large effects.
🧵5/9
- Heavily over-weight ratings
- Over-weight cheaper items when ratings are matched
- Are swayed by trivial order effects
- Fall for simple nudges (e.g. “Best seller”)
These are systematic, often large effects.
🧵5/9
Rather, they are strongly biased by these cues. We found agents are often 3-10x+ more susceptible to nudges and superficial attribute differences than our human baseline.
🧵4/9
Rather, they are strongly biased by these cues. We found agents are often 3-10x+ more susceptible to nudges and superficial attribute differences than our human baseline.
🧵4/9
We systematically manipulated:
💰Prices
⭐️Ratings
🔀Presentation order
👉Classic psychological nudges (authority, social proof, etc)
🧵3/9
We systematically manipulated:
💰Prices
⭐️Ratings
🔀Presentation order
👉Classic psychological nudges (authority, social proof, etc)
🧵3/9
It intercepts web content in real-time to run controlled experiments on agents by modifying the choice architecture.
Think of it as a behavioral science lab for LLMs.
Paper: arxiv.org/abs/2509.25609
🧵2/9
It intercepts web content in real-time to run controlled experiments on agents by modifying the choice architecture.
Think of it as a behavioral science lab for LLMs.
Paper: arxiv.org/abs/2509.25609
🧵2/9
4. 🧑 Humans, in contrast, are far less sensitive to such signals
4. 🧑 Humans, in contrast, are far less sensitive to such signals
1. 🛒 Choices are highly determined by rating, price, incentives, and nudges
2. 🔀 Models follow a lexicographic-like decision rule, hierarchically valuing different attributes
1. 🛒 Choices are highly determined by rating, price, incentives, and nudges
2. 🔀 Models follow a lexicographic-like decision rule, hierarchically valuing different attributes
💻 github.com/PapayaResear...
🧵3/3
💻 github.com/PapayaResear...
🧵3/3
CTAG: ctag.media.mit.edu
SynthAX: github.com/PapayaResear...
🧵2/3
CTAG: ctag.media.mit.edu
SynthAX: github.com/PapayaResear...
🧵2/3
🧵 3/3
🧵 3/3
🧵 2/3
🧵 2/3