anandraghavan.bsky.social
@anandraghavan.bsky.social
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces arxiv.org/abs/2601.11868
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently…
arxiv.org
February 18, 2026 at 11:01 PM
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks arxiv.org/abs/2601.02439
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets…
arxiv.org
February 17, 2026 at 11:01 PM
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

arxiv.org/abs/2601.09688
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require…
arxiv.org
February 12, 2026 at 11:00 PM
CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment arxiv.org/abs/2508.02298
CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically…
arxiv.org
December 26, 2025 at 11:01 PM
Evaluating AI’s ability to perform scientific research tasks openai.com/index/fronti...
Evaluating AI’s ability to perform scientific research tasks
We introduce FrontierScience, a new benchmark that evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology.
openai.com
December 25, 2025 at 11:00 PM
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality arxiv.org/abs/2512.107...
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text…
arxiv.org
December 24, 2025 at 11:00 PM
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following arxiv.org/abs/2511.10507
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted…
arxiv.org
December 22, 2025 at 11:01 PM
Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation arxiv.org/abs/2507.17937
Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation
Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing…
arxiv.org
December 19, 2025 at 11:00 PM
The State of Generative AI in the Enterprise - report from Menlo Ventures menlovc.com/perspective/...
2025: The State of Generative AI in the Enterprise | Menlo Ventures
For all the fears of over-investment, AI is spreading across enterprises at a pace with no precedent in modern software history.
menlovc.com
December 14, 2025 at 10:35 PM
How Google Maps quietly allocates survival across London’s restaurants - and how I built a dashboard to see through it laurenleek.substack.com/p/how-google...
How Google Maps quietly allocates survival across London’s restaurants - and how I built a dashboard to see through it
I wanted a dinner recommendation and got a research agenda instead. Using 13000+ restaurants, I rebuild its ratings with machine learning and map how algorithmic visibility actually distributes power.
laurenleek.substack.com
December 13, 2025 at 10:35 PM
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems arxiv.org/abs/2502.04510
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems
We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with…
arxiv.org
December 12, 2025 at 6:39 PM
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence huggingface.co/papers/2511....
Paper page - From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Join the discussion on this paper page
huggingface.co
December 11, 2025 at 10:35 PM
How confessions can keep language models honest openai.com/index/how-co...
openai.com
December 10, 2025 at 10:35 PM
Diversifying Society’s Leaders? The Determinants and Causal Effects of Admission to Highly Selective Private Colleges www.nber.org/papers/w31492
Diversifying Society’s Leaders? The Determinants and Causal Effects of Admission to Highly Selective Private Colleges
Founded in 1920, the NBER is a private, non-profit, non-partisan organization dedicated to conducting economic research and to disseminating research findings among academics, public policy makers,…
www.nber.org
December 10, 2025 at 6:39 PM
The Iceberg Index: Measuring Skills-centered Exposure in the AI Economy iceberg.mit.edu
Project Iceberg - Coordinating the Human-AI Future
${error.message}
iceberg.mit.edu
December 9, 2025 at 10:35 PM