Lightnews — Scholar-powered news

Yucheng Sun

@yuchengsun.bsky.social

5 followers 39 following 7 posts

Currently in ETH Zurich. Working on mechanistic interpretability.

Posts Replies Media Videos

Yucheng Sun

@yuchengsun.bsky.social

6/6: Thanks for the supervision
@alestolfo.bsky.social @mrinmaya.bsky.social

Check out our paper: arxiv.org/abs/2507.12379

Probing for Arithmetic Errors in Language Models

We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accuratel...

arxiv.org

July 18, 2025 at 5:27 PM

Yucheng Sun

@yuchengsun.bsky.social

5/6: Finally, we use this information as a weak oracle to trigger self-correction. Re-prompting the LM based on the probe’s prediction leads to a correction of up to 11% of the mistakes made by the model.

July 18, 2025 at 5:25 PM

Yucheng Sun

@yuchengsun.bsky.social

4/6: Can this be useful in a more realistic setting? We apply the probes trained on “pure arithmetic” queries to structured CoT traces obtained on GSM8K. The probes transfer well in a robust and consistent manner.

July 18, 2025 at 5:25 PM

Yucheng Sun

@yuchengsun.bsky.social

3/6: Given the previous results, it should be possible to predict the correctness of the model output. We designed lightweight probes that achieve high accuracy.

July 18, 2025 at 5:24 PM

Yucheng Sun

@yuchengsun.bsky.social

2/6: We feed an LM arithmetic queries and we train lightweight probes (e.g., circular) on its residual stream. Interestingly, they can accurately predict the ground-truth result, regardless of the LM's correctness.

July 18, 2025 at 5:23 PM

Yucheng Sun

@yuchengsun.bsky.social

Do you plan to work on AI safety/ alignment in the future?

January 11, 2025 at 2:07 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news