Lightnews — Scholar-powered news

Millicent Li

@millicentli.bsky.social

20 followers 13 following 8 posts

CS PhD Student @ Northeastern, former ugrad @ UW, UWNLP --
https://millicentli.github.io/

Posts Replies Media Videos

Millicent Li

@millicentli.bsky.social

In short: Verbalizer evals are broken! To know what info a model REMOVES from input, reconstruction is better than verbalization. And verbalization tells very little about what a model ADDS to input! w/A. Ceballos, G. Rogers, @nsaphra.bsky.social @byron.bsky.social

8/8

Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target ...

arxiv.org

September 17, 2025 at 7:19 PM

Millicent Li

@millicentli.bsky.social

What about the information a model ADDS to the embedding? Unfortunately, our experiments with synthetic fact datasets revealed that the verbalizer LM can only provide facts it already knows—it can’t describe facts only the target knows.

7/8

September 17, 2025 at 7:19 PM

Millicent Li

@millicentli.bsky.social

On our evaluation datasets, many LMs are in fact capable of largely reconstructing the target’s inputs from those internal representations! If we aim to know what information has been REMOVED by processing text into an embedding, inversion is more direct than verbalization.

6/8

September 17, 2025 at 7:19 PM

Millicent Li

@millicentli.bsky.social

Fine, but the verbalizer only has access to the target model’s internal representations, not to its inputs—or does it? Prior work in vision and language has shown model embeddings can be inverted to reconstruct inputs. Let’s see if these representations are invertible!

5/8

September 17, 2025 at 7:19 PM

Millicent Li

@millicentli.bsky.social

To the contrary, we find that all the verbalizer needs is the target model’s inputs! If it can just reconstruct the original inputs from the activations, the verbalizer’s LM can beat its own “interpretive” verbalization on most tasks, just by seeing the target model’s input.

4/8

September 17, 2025 at 7:19 PM

Millicent Li

@millicentli.bsky.social

First, a step back: How do we evaluate natural language interpretations of a target model’s representations? Often, by the accuracy of a verbalizer’s answers to simple factual questions. But does a verbalizer even need privileged information from the target model to succeed?

3/8

September 17, 2025 at 7:19 PM

Millicent Li

@millicentli.bsky.social

Methods like Patchscopes promise to allow inspection of the inner workings of LMs. But how do we know that the methods are describing info encoded in the target model’s activations, and not just in the verbalizer weights?

2/8

Asma Ghandeharioun on X: "🧵Can we “ask” an LLM to “translate” its own hidden representations into natural language? We propose 🩺Patchscopes, a new framework for decoding specific information from a representation by “patching” it into a separate inference pass, independently of its original context. 1/9 https://t.co/Of98dLBXLE" / X

🧵Can we “ask” an LLM to “translate” its own hidden representations into natural language? We propose 🩺Patchscopes, a new framework for decoding specific information from a representation by “patching” it into a separate inference pass, independently of its original context. 1/9 https://t.co/Of98dLBXLE

x.com

September 17, 2025 at 7:19 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news