Lightnews — Scholar-powered news

Xiaoyan Bai

@elenal3ai.bsky.social

📑Paper: arxiv.org/abs/2505.14905
💻Website: elena-baixy.github.io/concept-inco...

Concept Incongruence: An Exploration of Time and Death in Role Playing

Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to gener...

arxiv.org

November 24, 2025 at 7:18 PM

Xiaoyan Bai

@elenal3ai.bsky.social

✨We thank @boknilev.bsky.social for his insightful suggestions!

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

We want to make "AI doing science" something we can inspect and trust.
If you’re excited about grounded evaluation and want to push this forward, check out our blog and repo —contributions are welcome. 👇

🖊️ Blog: tinyurl.com/MechEvalAgents
🧑‍💻Repo: github.com/ChicagoHAI/M...

7/n🧵

MechEvalAgent: Grounded Evaluation of Research Agents in Mechanistic Interpretability | Notion

Xiaoyan Bai

tinyurl.com

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

🚧 What remains hard and what comes next:
- Better question design: It remains hard to automate questions that test generalization.
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic.

6/n🧵

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.

5/n🧵

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

❗️What we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization

4/n🧵

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

2️⃣ A grounded evaluation pipeline:
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can a fresh session rerun the experiment and get the same results?
- Generalizability: Can the agent design questions that demonstrate real insight transfer?

3/n🧵

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

We have two components.

1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable

2/n🧵

November 20, 2025 at 9:46 PM

Xiaoyan Bai

@elenal3ai.bsky.social

✨We thank @boknilev.bsky.social for his insightful suggestions!

November 20, 2025 at 9:37 PM

Xiaoyan Bai

@elenal3ai.bsky.social

🚧 What remains hard and what comes next:
We are actively working on the following issues:
- Better question design
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic for other fields.
6/n🧵

November 20, 2025 at 9:37 PM

Xiaoyan Bai

@elenal3ai.bsky.social

A failure example: The agent “validated” its circuit by checking whether the neurons it used happened to be on a list of names we provided.
5/n🧵

November 20, 2025 at 9:37 PM

Xiaoyan Bai

@elenal3ai.bsky.social

❗️What we found in our case studies:
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n🧵

November 20, 2025 at 9:37 PM

Xiaoyan Bai

@elenal3ai.bsky.social

2️⃣ An evaluation pipeline:
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can another agent rerun the experiment and get the same results?
- Generalizability: Can agents design questions that demonstrate insight transfer?
3/n🧵

November 20, 2025 at 9:37 PM

Xiaoyan Bai

@elenal3ai.bsky.social

We have two components.

1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable
2/n🧵

November 20, 2025 at 9:37 PM

Xiaoyan Bai

@elenal3ai.bsky.social

Your point connects closely with what we discussion in our future direction section — especially the idea of incorporating persistent memory and more explicit introspection mechanisms to bridge this gap.

October 30, 2025 at 2:21 PM

Xiaoyan Bai

@elenal3ai.bsky.social

In building more trustworthy AI, we would expect the model self-recognition in some contexts (e.g., when claiming ownership and psychological experiments) and suppress it in others (e.g., when serving as neutral judges).

October 30, 2025 at 2:21 PM

Xiaoyan Bai

@elenal3ai.bsky.social

Importantly, there’s often a misconception that LLMs already have self-recognition because they seem to show self-preference. Our results suggest that this is an attribution error: what looks like self-recognition is better explained by training data biases rather than awareness of authorship.

October 30, 2025 at 2:21 PM

Xiaoyan Bai

@elenal3ai.bsky.social

That’s a great point! In our work, we argue that self-recognition matters for building trust between humans and AI systems. A communication agent should be able to recognize its own mistakes and self recognition serves as a prerequisite for introspection and for any psychological evaluation of LLMs.

October 30, 2025 at 2:21 PM

Xiaoyan Bai

@elenal3ai.bsky.social

Also, check out this great paper by @imtd.bsky.social, @veniamin.bsky.social and Robert West et al. (arxiv.org/abs/2407.06946) on model self-recognition! Our work extended it with scalable synthetic generation tasks and reasoning-trace analysis.

Self-Recognition in Language Models

A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. ...

arxiv.org

October 27, 2025 at 7:08 PM

Xiaoyan Bai

@elenal3ai.bsky.social

Huge thanks to my wonderful collaborators: Aryan Shrivastava, @ari-holtzman.bsky.social, and @chenhaotan.bsky.social 💫
Grateful to Shi Feng for inspiring discussions and to Zimu Gong, Eugenia Chen, Yiyi Huang, Rose Shi, and Rina Zhou for suggestions on the figures!

October 27, 2025 at 5:36 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news