hjalmarw.bsky.social
@hjalmarw.bsky.social
Reposted
How close are current AI agents to automating AI research itself? Our new ML research engineering benchmark (RE-Bench) addresses this question by directly comparing frontier models such as Claude 3.5 Sonnet and o1-preview with 50+ human experts on 7 challenging research engineering tasks.
November 25, 2024 at 7:42 PM