Anna Tsvetkov
annatsv.bsky.social
Anna Tsvetkov
@annatsv.bsky.social
Postdoc @ Princeton AI Lab
Natural and Artificial Minds
Prev: PhD @ Brown, MIT FutureTech
Website: https://annatsv.github.io/
🔍 What are the limits of interpretability in ML?
Mech interp often stays at Marr’s algorithmic level but without the computational level (what the task is, what counts as the right solution) the mechanisms we find can look arbitrary. Why does a model learn one algorithm rather than another?
🧵 (1/2)
November 25, 2025 at 11:53 PM
Anthropic has a great new piece on “Signs of introspection in large language models” 👉 www.anthropic.com/research/int...

🤔 Neat evidence that LLMs can report on manipulated activations, with big caveats!

🧠 But leaves open: what are the “internal states” an LLM can introspect in the first place?
Emergent introspective awareness in large language models
Research from Anthropic on the ability of large language models to introspect
www.anthropic.com
November 1, 2025 at 4:49 PM
Reposted by Anna Tsvetkov
This is a beautiful paper! The first third helpfully labels a stream of recent work in philosophy of AI as "propositional interpretability". The idea is to use propositional attitudes like belief, desire, and intention, to help explain AI in a way that we can understand. 1/n
January 29, 2025 at 1:24 PM
Reposted by Anna Tsvetkov
"The AI risk repository, which includes over 700 AI risks grouped by causal factors (e.g. intentionality), and domains (e.g. discrimination), was born out of a desire to understand the overlaps and disconnects in AI safety research"
#AIEthics

techcrunch.com/2024/08/14/m...
MIT researchers release a repository of AI risks | TechCrunch
A group of researchers at MIT and elsewhere have compiled what they claim is the most thorough databases of possible risks around AI use.
techcrunch.com
January 5, 2025 at 9:03 PM