Lightnews — Scholar-powered news

Clément Dumas

@butanium.bsky.social

Very cool analysis by Arnab which cover the mechanisms used for retrieval both when your query is before or after the text!

Arnab Sen Sharma @arnabsensharma.bsky.social · 13d

How can a language model find the veggies in a menu?

New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.

Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧵

November 5, 2025 at 1:38 PM

Clément Dumas

@butanium.bsky.social

A very important paper led by Julian!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!

"Your Narrow Finetuning is showing", image of robots (representing LLMs) with signs disclosing their finetuning objectives

October 20, 2025 at 3:20 PM

Clément Dumas

@butanium.bsky.social

To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*

Check our blogpost out! 🧵

Julian Minder @jkminder.bsky.social · Sep 5

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

September 5, 2025 at 7:23 PM

Reposted by Clément Dumas

John David Pressman

@jdp.extropian.net

GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.

August 29, 2025 at 1:59 AM

Reposted by Clément Dumas

David Bau

@davidbau.bsky.social

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

New England Mechanistic Interpretability Workshop

About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround...

www.youtube.com

August 18, 2025 at 6:06 PM

Reposted by Clément Dumas

NDIF Team

@ndif-team.bsky.social

Excited to share our first paper replication tutorial, walking you through the main figures from "Do Language Models Use Their Depth Efficiently?" by @robertcsordas.bsky.social

🔎 Demo on Colab: colab.research.google.com/github/ndif-...

📖 Read the full manuscript: arxiv.org/abs/2505.13898

Google Colab

colab.research.google.com

July 4, 2025 at 12:27 AM

Reposted by Clément Dumas

Julian Minder

@jkminder.bsky.social

With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

June 30, 2025 at 9:02 PM

Clément Dumas

@butanium.bsky.social

Our mech interp ICML workshop paper got accepted to ACL 2025 main! 🎉
In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵

Clément Dumas on X: "Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i" / X

Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i

x.com

June 29, 2025 at 11:07 PM

Reposted by Clément Dumas

Geoffrey Irving

@girving.bsky.social

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y_1, y_2, . . . , y_q , and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns
probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw.

June 17, 2025 at 4:52 PM

Clément Dumas

@butanium.bsky.social

We'll be presenting at the #ICLR sparsity in LLMs workshop today (Sunday 27th) at 4:30 pm in Hall 4 #7!

Clément Dumas @butanium.bsky.social · Apr 7

New paper w/@jkminder.bsky.social & @neelnanda.bsky.social
What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features!🧵

April 26, 2025 at 8:02 PM

Clément Dumas

@butanium.bsky.social

Want to explore cool chat related crosscoder latents?
With @jkminder.bsky.social, we made a demo that supports both loading our max activating examples AND running the crosscoder with your own prompt to collect the activations of specific latents!
Send us the cool latents you find! dub.sh/ccdm

April 9, 2025 at 10:48 PM

Reposted by Clément Dumas

Julian Minder

@jkminder.bsky.social

In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.

April 7, 2025 at 5:56 PM

Clément Dumas

@butanium.bsky.social

New paper w/@jkminder.bsky.social & @neelnanda.bsky.social
What do chat LLMs learn in finetuning?

Anthropic introduced a tool for this: crosscoders, an SAE variant. We find key limitations of crosscoders & fix them with BatchTopK crosscoders

This finds interpretable and causal chat-only features!🧵

April 7, 2025 at 4:21 PM

Clément Dumas

@butanium.bsky.social

Very cool work that introduces "concept heads" that copy meanings from one token to another.
I love that our activation-based analysis of multilingual representation has now additional insight from weight space analysis: bsky.app/profile/sfeu...

Sheridan Feucht @sfeucht.bsky.social · Apr 7

[📄] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.

April 7, 2025 at 3:26 PM

Reposted by Clément Dumas

Alex Turner

@turntrout.bsky.social

Want to get into alignment research? Alex Cloud & I mentor *Team Shard*, responsible for gradient routing, steering vectors, MELBO, and a new unlearning technique (TBA) :) We discover new research subfields.

Apply for mentorship this summer at forms.matsprogram.org/turner-app-8

March 20, 2025 at 4:14 PM

Reposted by Clément Dumas

Andrew Lee

@ajyl.bsky.social

Excited about recent reasoning models? What is happening under the hood?
Join ARBOR: Analysis of Reasoning Behaviors thru *Open Research* - a radically open collaboration to reverse-engineer reasoning models!
Learn more: arborproject.github.io
1/N

ARBOR

arborproject.github.io

February 20, 2025 at 7:55 PM

Reposted by Clément Dumas

neelnanda.bsky.social

@neelnanda.bsky.social

As part of opening my new round of MATS applications, I took this as an excuse to write up which research discussions I'm currently just excited about, and recent updates I've made - I thought this might be of more general interest!
x.com/NeelNanda5/...

February 8, 2025 at 1:57 PM

Reposted by Clément Dumas

neelnanda.bsky.social

@neelnanda.bsky.social

LLM agents will be a very big deal, bringing many weird new forms of reward hacking and other subtle failures

This great GDM safety paper shows that myopically optimising for plans that an overseer approves of, rather than outcomes, reduces these issues while performing well!
x.com/davlindner/...

January 23, 2025 at 4:36 PM

Reposted by Clément Dumas

NDIF Team

@ndif-team.bsky.social

Do you have a great experiment that you want to run on Llama 405b but not enough GPUs?

🚨 #NDIF is opening up more spots in our 405b pilot program! Apply now for a chance to conduct your own groundbreaking experiments on the 405b model. Details: 🧵⬇️

December 9, 2024 at 8:04 PM

Reposted by Clément Dumas

Julian Minder

@jkminder.bsky.social

Can we understand and control how language models balance context and prior knowledge? Our latest paper shows it’s all about a 1D knob! 🎛️
arxiv.org/abs/2411.07404

Co-led with
@kevdududu.bsky.social - @niklasstoehr.bsky.social , Giovanni Monea, @wendlerc.bsky.social, Robert West & Ryan Cotterell.

November 22, 2024 at 3:49 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news