Mostly a retrospective on how I mourned RL after AlphaZero and how much better it feels that it's back.
If you weren't working with DQNs it's hard to appreciate just how well things work with LLMs.
hackbot.dad/writing/rl-l...
Mostly a retrospective on how I mourned RL after AlphaZero and how much better it feels that it's back.
If you weren't working with DQNs it's hard to appreciate just how well things work with LLMs.
hackbot.dad/writing/rl-l...
- Chart crime
- Why I barely read the model card anymore
- Why public benchmarks aren't very relevant to you, and you should invest the time in building something custom
hackbot.dad/writing/agon...
- Chart crime
- Why I barely read the model card anymore
- Why public benchmarks aren't very relevant to you, and you should invest the time in building something custom
hackbot.dad/writing/agon...
PentestJudge—Judging Agent Behavior Against Operational Requirements: arxiv.org/abs/2508.02921
Explore how we built an LLM-as-judge system for evaluating the operations of pentesting agents (inspired by PaperBench).
PentestJudge—Judging Agent Behavior Against Operational Requirements: arxiv.org/abs/2508.02921
Explore how we built an LLM-as-judge system for evaluating the operations of pentesting agents (inspired by PaperBench).