Xiaoyan Bai
banner
elenal3ai.bsky.social
Xiaoyan Bai
@elenal3ai.bsky.social
PhD @UChicagoCS / BE in CS @Umich / ✨AI/NLP transparency and interpretability/📷🎨photography painting
✨We thank @boknilev.bsky.social for his insightful suggestions!
November 20, 2025 at 9:46 PM
We want to make "AI doing science" something we can inspect and trust.
If you’re excited about grounded evaluation and want to push this forward, check out our blog and repo —contributions are welcome. 👇

🖊️ Blog: tinyurl.com/MechEvalAgents
🧑‍💻Repo: github.com/ChicagoHAI/M...

7/n🧵
MechEvalAgent: Grounded Evaluation of Research Agents in Mechanistic Interpretability | Notion
Xiaoyan Bai
tinyurl.com
November 20, 2025 at 9:46 PM
🚧 What remains hard and what comes next:
- Better question design: It remains hard to automate questions that test generalization.
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic.

6/n🧵
November 20, 2025 at 9:46 PM
A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.

5/n🧵
November 20, 2025 at 9:46 PM
❗️What we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization

4/n🧵
November 20, 2025 at 9:46 PM
2️⃣ A grounded evaluation pipeline:
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can a fresh session rerun the experiment and get the same results?
- Generalizability: Can the agent design questions that demonstrate real insight transfer?

3/n🧵
November 20, 2025 at 9:46 PM
We have two components.

1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable

2/n🧵
November 20, 2025 at 9:46 PM
✨We thank @boknilev.bsky.social for his insightful suggestions!
November 20, 2025 at 9:37 PM
🚧 What remains hard and what comes next:
We are actively working on the following issues:
- Better question design
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic for other fields.
6/n🧵
November 20, 2025 at 9:37 PM
A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.
5/n🧵
November 20, 2025 at 9:37 PM
❗️What we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n🧵
November 20, 2025 at 9:37 PM
2️⃣ An evaluation pipeline:
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can another agent rerun the experiment and get the same results?
- Generalizability: Can agents design questions that demonstrate insight transfer?
3/n🧵
November 20, 2025 at 9:37 PM
We have two components.

1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable
2/n🧵
November 20, 2025 at 9:37 PM
Your point connects closely with what we discussion in our future direction section — especially the idea of incorporating persistent memory and more explicit introspection mechanisms to bridge this gap.
October 30, 2025 at 2:21 PM
In building more trustworthy AI, we would expect the model self-recognition in some contexts (e.g., when claiming ownership and psychological experiments) and suppress it in others (e.g., when serving as neutral judges).
October 30, 2025 at 2:21 PM
Importantly, there’s often a misconception that LLMs already have self-recognition because they seem to show self-preference. Our results suggest that this is an attribution error: what looks like self-recognition is better explained by training data biases rather than awareness of authorship.
October 30, 2025 at 2:21 PM
That’s a great point! In our work, we argue that self-recognition matters for building trust between humans and AI systems. A communication agent should be able to recognize its own mistakes and self recognition serves as a prerequisite for introspection and for any psychological evaluation of LLMs.
October 30, 2025 at 2:21 PM
Also, check out this great paper by @imtd.bsky.social, @veniamin.bsky.social and Robert West et al. (arxiv.org/abs/2407.06946) on model self-recognition! Our work extended it with scalable synthetic generation tasks and reasoning-trace analysis.
Self-Recognition in Language Models
A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. ...
arxiv.org
October 27, 2025 at 7:08 PM
Huge thanks to my wonderful collaborators: Aryan Shrivastava, @ari-holtzman.bsky.social, and @chenhaotan.bsky.social 💫
Grateful to Shi Feng for inspiring discussions and to Zimu Gong, Eugenia Chen, Yiyi Huang, Rose Shi, and Rina Zhou for suggestions on the figures!
October 27, 2025 at 5:36 PM