If you’re excited about grounded evaluation and want to push this forward, check out our blog and repo —contributions are welcome. 👇
🖊️ Blog: tinyurl.com/MechEvalAgents
🧑💻Repo: github.com/ChicagoHAI/M...
7/n🧵
If you’re excited about grounded evaluation and want to push this forward, check out our blog and repo —contributions are welcome. 👇
🖊️ Blog: tinyurl.com/MechEvalAgents
🧑💻Repo: github.com/ChicagoHAI/M...
7/n🧵
- Better question design: It remains hard to automate questions that test generalization.
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic.
6/n🧵
- Better question design: It remains hard to automate questions that test generalization.
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic.
6/n🧵
5/n🧵
5/n🧵
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n🧵
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n🧵
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can a fresh session rerun the experiment and get the same results?
- Generalizability: Can the agent design questions that demonstrate real insight transfer?
3/n🧵
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can a fresh session rerun the experiment and get the same results?
- Generalizability: Can the agent design questions that demonstrate real insight transfer?
3/n🧵
1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable
2/n🧵
1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable
2/n🧵
We are actively working on the following issues:
- Better question design
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic for other fields.
6/n🧵
We are actively working on the following issues:
- Better question design
- Meta-evaluation: How do we evaluate the evaluators?
- Domain adapters: Scaling beyond Mech Interp requires expert-defined logic for other fields.
6/n🧵
5/n🧵
5/n🧵
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n🧵
We tested across three tasks: IOI replication, open-ended sarcasm circuit locating, and a human-written repo. Three failure modes kept appearing:
- Lack of Meta-Knowledge
- Implicit Hallucinations
- Undefined Generalization
4/n🧵
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can another agent rerun the experiment and get the same results?
- Generalizability: Can agents design questions that demonstrate insight transfer?
3/n🧵
- Coherence: Do the implementation, results, and claims line up?
- Reproducibility: Can another agent rerun the experiment and get the same results?
- Generalizability: Can agents design questions that demonstrate insight transfer?
3/n🧵
1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable
2/n🧵
1️⃣ A unified research-output format:
To evaluate execution, we first unified agent outputs into a standard format:
Plan → Code → Walkthrough → Report.
This makes agents comparable and makes their reasoning trace inspectable
2/n🧵
Grateful to Shi Feng for inspiring discussions and to Zimu Gong, Eugenia Chen, Yiyi Huang, Rose Shi, and Rina Zhou for suggestions on the figures!
Grateful to Shi Feng for inspiring discussions and to Zimu Gong, Eugenia Chen, Yiyi Huang, Rose Shi, and Rina Zhou for suggestions on the figures!