Yucheng Sun
yuchengsun.bsky.social
Yucheng Sun
@yuchengsun.bsky.social
Currently in ETH Zurich. Working on mechanistic interpretability.
5/6: Finally, we use this information as a weak oracle to trigger self-correction. Re-prompting the LM based on the probe’s prediction leads to a correction of up to 11% of the mistakes made by the model.
July 18, 2025 at 5:25 PM
4/6: Can this be useful in a more realistic setting? We apply the probes trained on “pure arithmetic” queries to structured CoT traces obtained on GSM8K. The probes transfer well in a robust and consistent manner.
July 18, 2025 at 5:25 PM
3/6: Given the previous results, it should be possible to predict the correctness of the model output. We designed lightweight probes that achieve high accuracy.
July 18, 2025 at 5:24 PM
2/6: We feed an LM arithmetic queries and we train lightweight probes (e.g., circular) on its residual stream. Interestingly, they can accurately predict the ground-truth result, regardless of the LM's correctness.
July 18, 2025 at 5:23 PM
Do you plan to work on AI safety/ alignment in the future?
January 11, 2025 at 2:07 PM