We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.
We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.
We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.
We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.
The paper is also live on arXiv, with two new sections: One discussing alternative uncertainty estimation methods, and a new 'bias from developer recruitment' factor that has unclear effect on slowdown.
The paper is also live on arXiv, with two new sections: One discussing alternative uncertainty estimation methods, and a new 'bias from developer recruitment' factor that has unclear effect on slowdown.
i’ve only ever used copilot in vs code and 95% of what I hear about is cursor
any recs for AI code tools that are cool or weird or interesting or ones people are sleeping on?
good analysis from @simonwillison.net on cursor’s confusing pricing/usage changes
ai.meta.com/research/pub...
ai.meta.com/research/pub...