Graduate student at @bethgelab.bsky.social
oripress.com
To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!
algotune.io
To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!
algotune.io
CiteMe is a challenging benchmark for LM-based agents to find paper citations, moving beyond simple multiple-choice Q&A to real-world use cases.
Come by and say hi :)
citeme.ai
CiteMe is a challenging benchmark for LM-based agents to find paper citations, moving beyond simple multiple-choice Q&A to real-world use cases.
Come by and say hi :)
citeme.ai
I develop autonomous systems for: programming, research-level question answering, finding sec vulnerabilities & other useful+challenging tasks.
I do this by building frontier-pushing benchmarks and agents that do well on them.
See you at NeurIPS!
I develop autonomous systems for: programming, research-level question answering, finding sec vulnerabilities & other useful+challenging tasks.
I do this by building frontier-pushing benchmarks and agents that do well on them.
See you at NeurIPS!