https://www.akbir.dev
Before we only featured our own evaluation results, but this new data comes from trusted external leaderboards. And we've got more on the way 🧵
Before we only featured our own evaluation results, but this new data comes from trusted external leaderboards. And we've got more on the way 🧵
@akbir.bsky.social
This benchmark uses the factory-building game Factorio to test complex, long-term planning, with settings for lab-play (structured tasks) and open-play (unbounded growth).
jackhopkins.github.io/factorio-lea...
@akbir.bsky.social
This benchmark uses the factory-building game Factorio to test complex, long-term planning, with settings for lab-play (structured tasks) and open-play (unbounded growth).
jackhopkins.github.io/factorio-lea...
As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.
As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.
its really sensible, practical and can be done now, even before systems are superintelligent.
youtu.be/6Unxqr50Kqg?...
its really sensible, practical and can be done now, even before systems are superintelligent.
youtu.be/6Unxqr50Kqg?...
Beautiful.
Beautiful.
Executive Order 14110 was revoked (Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence)
Executive Order 14110 was revoked (Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence)
And it helped all students, especially girls who were initially behind.
And it helped all students, especially girls who were initially behind.
…yet in many ways LLMs are uniquely powerful among new technologies for helping people equitably in education and healthcare. We need an urgent focus on how to do that
…yet in many ways LLMs are uniquely powerful among new technologies for helping people equitably in education and healthcare. We need an urgent focus on how to do that
On the other hand, scientific illustrations are apparently just anime now arxiv.org/pdf/2501.06458
On the other hand, scientific illustrations are apparently just anime now arxiv.org/pdf/2501.06458
alignment.anthropic.com/2025/recomme...
alignment.anthropic.com/2025/recomme...