Lightnews — Scholar-powered news

Andrew Lee

@ajyl.bsky.social

Thank you Naomi!!

October 9, 2025 at 3:20 AM

Andrew Lee

@ajyl.bsky.social

This project was done via Arbor! arborproject.github.io
Check us out to see on-going work to interp reasoning models.
Thank you collaborators! Lihao Sun,
@wendlerc.bsky.social ,
@viegas.bsky.social ,
@wattenberg.bsky.social

Paper link: arxiv.org/abs/2504.14379
9/n

ARBOR

arborproject.github.io

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

Our interpretation:
✅we find subspace critical for self-verif.
✅in our setup, prev-token heads take resid-stream into this subspace. In a different task, a diff. mechanism may be used.
✅ this subspace activates verif-related MLP weights, promoting tokens like “success”

8/n

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

We find similar verif. subspaces in our base model and general reasoning model (DeepSeek R1-14B).

Here we provide CountDown as a ICL task.

Interestingly, in R1-14B, our interventions lead to partial success - the LM fails self-verification but then self-corrects itself.

7/n

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

Our analyses meet in the middle:

We use “interlayer communication channels” to rank how much each head (OV circuit) aligns with the “receptive fields” of verification-related MLP weights.

Disable *three* heads → disables self-verif. and deactivates verif.-MLP weights.

6/n

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

Bottom-up, we find previous-token heads (i.e., parts of induction heads) are responsible for self-verification in our setup. Disabling previous-token heads disables self-verification.

5/n

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

More importantly, we can use the probe to find MLP weights related to verification. Simply check for MLP weights with high cosine similarity to our probe.

Interestingly, we often see Eng. tokens for "valid direction" and Chinese tokens for "invalid direction".

4/n

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

We do a “top-down” and “bottom-up” analysis. Top-down, we train a probe. We can use our probe to steer the model and trick it to have found a solution.

3/n

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

CoT is unfaithful. Can we monitor inner computations in latent space instead?

Case study: Let’s study self-verification!

Setup: We train Qwen-3B on CountDown until mode collapse, resulting in nicely structured CoT that’s easy to parse+analyze

2/n

May 13, 2025 at 6:52 PM

Andrew Lee

@ajyl.bsky.social

a man with glasses is sitting on a couch and saying pretty pretty pretty good .

ALT: a man with glasses is sitting on a couch and saying pretty pretty pretty good .

media.tenor.com

May 9, 2025 at 1:34 AM

Andrew Lee

@ajyl.bsky.social

Interesting, I didn't know that! BTW, we find similar trends in GPT2 and Gemma2

May 7, 2025 at 1:56 PM

Andrew Lee

@ajyl.bsky.social

Next time a reviewer asks, “Why didn’t you include [insert newest LM]?”, depending on your claims you could argue that your findings will generalize to other models, based on our work! Paper link: arxiv.org/abs/2503.21073

Shared Global and Local Geometry of Language Model Embeddings

Researchers have recently suggested that models share common representations. In our work, we find that token embeddings of language models exhibit common geometric structure. First, we find ``global'...

arxiv.org

May 7, 2025 at 1:38 PM

Andrew Lee

@ajyl.bsky.social

We call this simple approach Emb2Emb: Here we steer Llama8B using steering vectors from Llama1B and 3B:

May 7, 2025 at 1:38 PM

Andrew Lee

@ajyl.bsky.social

Now, steering vectors can be transferred across LMs. Given LM1, LM2 & their embeddings E1, E2, fit a linear transform T from E1 to E2. Given steering vector V for LM1, apply T onto V, and now TV can steer LM2. Unembedding V or TV shows similar nearest neighbors encoding the steer vector’s concept:

May 7, 2025 at 1:38 PM

Andrew Lee

@ajyl.bsky.social

Local2: We measure intrinsic dimension (ID) of token embeddings. Interestingly, ID reveals that tokens with low ID form very coherent semantic clusters, while tokens with higher ID do not!

May 7, 2025 at 1:38 PM

Andrew Lee

@ajyl.bsky.social

Local: we characterize two ways: first using Locally Linear Embeddings (LLE), in which we express each token embedding as the weighted sum of its k-nearest neighbors. It turns out, the LLE weights for most tokens look very similar across language models, indicating similar local geometry:

May 7, 2025 at 1:38 PM

Andrew Lee

@ajyl.bsky.social

We characterize “global” and “local” geometry in simple terms.
Global: how similar are the distance matrices of embeddings across LMs? We can check with Pearson correlation between distance matrices: high correlation indicates similar relative orientations of token embeddings, which is what we find

May 7, 2025 at 1:38 PM

Andrew Lee

@ajyl.bsky.social

Cool! QQ: say I have a "mech-interpy finding": for instance, say I found a "circuit" - is such a finding appropriate to submit, or is the workshop exclusively looking for actionable insights?

March 31, 2025 at 6:43 PM

Andrew Lee

@ajyl.bsky.social

I think these papers are highly relevant but missing!
aclanthology.org/2023.emnlp-m...
arxiv.org/pdf/2403.07687

Bridging the Digital Divide: Performance Variation across Socio-Economic Factors in Vision-Language Models

Joan Nwatu, Oana Ignat, Rada Mihalcea. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

aclanthology.org

March 23, 2025 at 4:32 PM

Andrew Lee

@ajyl.bsky.social

My website has a personal readme file with step by step instructions on how to make an update. I would need to hire someone if something were to ever happen to that readme file.

March 5, 2025 at 2:56 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news