Mainly interested in Language Model Interpretability and Model Diffing.
MATS 7.0 Winter 2025 Scholar w/ Neel Nanda
jkminder.ch
Paper: www.arxiv.org/abs/2510.13900
(9/9)
Paper: www.arxiv.org/abs/2510.13900
(9/9)
Blogpost: www.alignmentforum.org/posts/sBSjEB... (8/8)
Blogpost: www.alignmentforum.org/posts/sBSjEB... (8/8)
Setup: We compute per-position average activation differences between a base and finetuned model on unrelated text. Inspect with Patchscope and by steering the finetuned model with the differences. (2/8)
Setup: We compute per-position average activation differences between a base and finetuned model on unrelated text. Inspect with Patchscope and by steering the finetuned model with the differences. (2/8)
More detailed thread: bsky.app/profile/deni...
More detailed thread: bsky.app/profile/deni...
Paper: arxiv.org/abs/2507.08802
Paper: arxiv.org/abs/2507.08802
Post: lesswrong.com/posts/xmpauE...
Paper Thread: bsky.app/profile/buta...
Paper: arxiv.org/abs/2504.02922
Post: lesswrong.com/posts/xmpauE...
Paper Thread: bsky.app/profile/buta...
Paper: arxiv.org/abs/2504.02922