shane caldwell
shanecaldwell.bsky.social
shane caldwell
@shanecaldwell.bsky.social
synthetic data, RL, hackbots - writing at https://hackbot.dad/
new blog: RL Needed LLMs Because Agency Requires Priors

Mostly a retrospective on how I mourned RL after AlphaZero and how much better it feels that it's back.

If you weren't working with DQNs it's hard to appreciate just how well things work with LLMs.

hackbot.dad/writing/rl-l...
August 25, 2025 at 2:16 PM
GPT-5 had a lot of mixed reactions over the last week or so and I wanted to talk about:

- Chart crime
- Why I barely read the model card anymore
- Why public benchmarks aren't very relevant to you, and you should invest the time in building something custom

hackbot.dad/writing/agon...
GPT-5 is Good, Actually: The Agony and Ecstasy of Public Benchmarks
An attempt to explain why benchmarks are either bad or secret, and why the bar charts don't matter so much.
hackbot.dad
August 18, 2025 at 12:07 AM
Reposted by shane caldwell
Incoming: Dreadnode paper drop from Shane Caldwell and the crew.

PentestJudge—Judging Agent Behavior Against Operational Requirements: arxiv.org/abs/2508.02921

Explore how we built an LLM-as-judge system for evaluating the operations of pentesting agents (inspired by PaperBench).
August 6, 2025 at 6:31 PM