1. scrape the Archive for Google queries and audit decreasing frontpage quality
2. start longitudinal collection on Google quality (and the LLMs now before in-chat ads arrive)
3. More generally do "prospective science", collect data now about things we think will go down the toilet.
1. scrape the Archive for Google queries and audit decreasing frontpage quality
2. start longitudinal collection on Google quality (and the LLMs now before in-chat ads arrive)
3. More generally do "prospective science", collect data now about things we think will go down the toilet.
metr.github.io/autonomy-eva...
metr.github.io/autonomy-eva...
epoch.ai/frontiermath
If you assume GPT-5 fails all 23 excluded SWE-Bench problems, then Claude 4.0 > GPT-5
x.com/gneubig/stat...
other coding
x.com/eli_lifland/...
aider.chat/docs/leaderboa
epoch.ai/frontiermath
If you assume GPT-5 fails all 23 excluded SWE-Bench problems, then Claude 4.0 > GPT-5
x.com/gneubig/stat...
other coding
x.com/eli_lifland/...
aider.chat/docs/leaderboa