Lightnews — Scholar-powered news

Anna Tsvetkov

@annatsv.bsky.social

1.1K followers 540 following 5 posts

Postdoc @ Princeton AI Lab
Natural and Artificial Minds
Prev: PhD @ Brown, MIT FutureTech
Website: https://annatsv.github.io/

Posts Replies Media Videos

Anna Tsvetkov

@annatsv.bsky.social

🔍 What are the limits of interpretability in ML?
Mech interp often stays at Marr’s algorithmic level but without the computational level (what the task is, what counts as the right solution) the mechanisms we find can look arbitrary. Why does a model learn one algorithm rather than another?
🧵 (1/2)

November 25, 2025 at 11:53 PM

Anna Tsvetkov

@annatsv.bsky.social

Anthropic has a great new piece on “Signs of introspection in large language models” 👉 www.anthropic.com/research/int...

🤔 Neat evidence that LLMs can report on manipulated activations, with big caveats!

🧠 But leaves open: what are the “internal states” an LLM can introspect in the first place?

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

www.anthropic.com

November 1, 2025 at 4:49 PM

Reposted by Anna Tsvetkov

Harvey Lederman

@harveylederman.bsky.social

This is a beautiful paper! The first third helpfully labels a stream of recent work in philosophy of AI as "propositional interpretability". The idea is to use propositional attitudes like belief, desire, and intention, to help explain AI in a way that we can understand. 1/n

David Chalmers @davidchalmers.bsky.social · Jan 28

a draft paper (for an invited talk at AAAI next month) with a philosophical analysis of work on mechanistic interpretability, with special attention to methods for propositional interpretability.

arxiv.org/abs/2501.15740

Propositional Interpretability in Artificial Intelligence

Mechanistic interpretability is the program of explaining what AI systems are doing in terms of their internal mechanisms. I analyze some aspects of the program, along with setting out some concrete c...

arxiv.org

January 29, 2025 at 1:24 PM

Reposted by Anna Tsvetkov

Carissa Véliz

@carissaveliz.bsky.social

"The AI risk repository, which includes over 700 AI risks grouped by causal factors (e.g. intentionality), and domains (e.g. discrimination), was born out of a desire to understand the overlaps and disconnects in AI safety research"
#AIEthics

techcrunch.com/2024/08/14/m...

MIT researchers release a repository of AI risks | TechCrunch

A group of researchers at MIT and elsewhere have compiled what they claim is the most thorough databases of possible risks around AI use.

techcrunch.com

January 5, 2025 at 9:03 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news