galen
banner
nel.ag
galen
@nel.ag
learning
new blogpost out :)
October 1, 2025 at 3:28 PM
Reposted by galen
We tested how autonomous AI agents perform on real software tasks from our recent developer productivity RCT.

We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.
August 13, 2025 at 10:38 PM
August 13, 2025 at 10:26 AM
Reposted by galen
In a new report, we evaluate whether GPT-5 poses significant catastrophic risks via AI R&D acceleration, rogue replication, or sabotage of AI labs.

We conclude that this seems unlikely. However, capability trends continue rapidly, and models display increasing eval awareness.
August 8, 2025 at 1:20 AM
the model doesn’t self-correct after 3 pages of pushback, but solves it fine with a slightly different prompt. the brittleness here is obviously not just a tokenizer thing.
We have a lot of fun tripping up AI with this, but asking it to parse a word by individual letters is kind of a nonsensical question given how tokenizers operate. It's like asking a Chinese speaker how many G's are in 中国, that's not how they process language.
August 8, 2025 at 12:48 PM
Reposted by galen
We have open-sourced anonymized data and core analysis code for our developer productivity RCT.

The paper is also live on arXiv, with two new sections: One discussing alternative uncertainty estimation methods, and a new 'bias from developer recruitment' factor that has unclear effect on slowdown.
July 30, 2025 at 8:10 PM
I'd really like to see a tool built around doing many parallel generations, esp with unit tests. Seems like a big strength of llms that's totally missing from the mainstream stuff
what else are people using btw?
i’ve only ever used copilot in vs code and 95% of what I hear about is cursor

any recs for AI code tools that are cool or weird or interesting or ones people are sleeping on?
dame.is dame @dame.is · Jul 5
“Firstly, the era of VC-subsidized tokens may be coming to an end, especially for products like Cursor which are way past demonstrating product-market fit.”

good analysis from @simonwillison.net on cursor’s confusing pricing/usage changes
July 8, 2025 at 4:05 PM
checking if bsky uses POST or PUT requests for putting posts in the database
June 19, 2025 at 9:16 PM
ai.meta.com
June 11, 2025 at 5:29 PM
experiments should be tracked as a relational graph
June 1, 2025 at 10:55 PM