https://sebschu.com
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
w/ Michelle Yang, @sivareddyg.bsky.social , @msonderegger.bsky.social and @dallascard.bsky.social👇(1/12)
We're looking for someone to join the research agent evaluation team, starting Fall 2025. Application link to be available soon, but feel free to send us your CV and/or come talk to us at #ACL2025. 🧵
We're looking for someone to join the research agent evaluation team, starting Fall 2025. Application link to be available soon, but feel free to send us your CV and/or come talk to us at #ACL2025. 🧵
We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code.
Finding: Most agents we tested had a low success rate, but there is promise!
We introduce RExBench, a benchmark that tests if a coding agent can implement a novel experiment based on existing research and code.
Finding: Most agents we tested had a low success rate, but there is promise!