Arthur Conmy
arthurconmy.bsky.social
Arthur Conmy
@arthurconmy.bsky.social
Aspiring 10x reverse engineer at Google DeepMind
Good question. I'm not sure. arxiv.org/abs/2310.18512 has a bunch of discussion of mitigations
Preventing Language Models From Hiding Their Reasoning
Large language models (LLMs) often benefit from intermediate steps of reasoning to generate answers to complex problems. When these intermediate steps of reasoning are used to monitor the activity of ...
arxiv.org
January 4, 2025 at 3:33 PM
This also isn't just about edge cases either; 1) happens with nice models like Claude, and 2) is even true for dumb models like Gemma-2 2B
January 2, 2025 at 8:50 PM
2) Verification is considerably harder than generation. Even when there are a few 100 of tokens, often it takes me several minutes to understand whether reasoning is OK or not
January 2, 2025 at 8:50 PM
Still, there is no ground truth for interpretability, so progress is tough
December 10, 2024 at 7:45 PM
Awesome research. The caveat is that humans were working for 8 hours, but they were explicitly encouraged to get results after 2 hours so I buy the claim.
November 22, 2024 at 10:22 PM