Takeaway: fairness interventions must be mechanism-aware and task-specific.
With the right causal targets, we can do surgical debiasing while preserving general capabilities.
📃 Paper: arxiv.org/pdf/2512.20796
🙏 Amazing advisor: Aaron Mueller @amuuueller.bsky.social
Takeaway: fairness interventions must be mechanism-aware and task-specific.
With the right causal targets, we can do surgical debiasing while preserving general capabilities.
📃 Paper: arxiv.org/pdf/2512.20796
🙏 Amazing advisor: Aaron Mueller @amuuueller.bsky.social
Mechanistically, bias often doesn’t live in explicit demographic tokens. It instead hides in contextual proxies like formality, technical language, and “competence” cues. This explains why direct ablation methods can often fail.
Mechanistically, bias often doesn’t live in explicit demographic tokens. It instead hides in contextual proxies like formality, technical language, and “competence” cues. This explains why direct ablation methods can often fail.
We find that race, gender, and education shortcuts rely on different internal mechanisms, so no single debiasing method works universally.
In other words, there is no one-size-fits-all debiasing method!
We find that race, gender, and education shortcuts rely on different internal mechanisms, so no single debiasing method works universally.
In other words, there is no one-size-fits-all debiasing method!
We compare attribution-based (“output” features) and correlation-based (“input” features) steering in LLMs. This follows the input/output distinction of @danaarad.bsky.social and @boknilev.bsky.social: some representations detect concepts in inputs, while others predict concepts in outputs.
We compare attribution-based (“output” features) and correlation-based (“input” features) steering in LLMs. This follows the input/output distinction of @danaarad.bsky.social and @boknilev.bsky.social: some representations detect concepts in inputs, while others predict concepts in outputs.
We study the use of demographic information where this info is:
• causally relevant (name → demographic),
• irrelevant (profession → demographic), or
• partially relevant (profession → education).
This lets us separate legitimate recognition from stereotyping.
We study the use of demographic information where this info is:
• causally relevant (name → demographic),
• irrelevant (profession → demographic), or
• partially relevant (profession → education).
This lets us separate legitimate recognition from stereotyping.
We study implicit biases via a word association task: the model assigns demographic labels to names or professions (e.g., “engineer → ?”, “Jack → ?”).
Inspired by prior work on implicit associations in LLMs (e.g., Xuechunzi Bai et al., 2025).
We study implicit biases via a word association task: the model assigns demographic labels to names or professions (e.g., “engineer → ?”, “Jack → ?”).
Inspired by prior work on implicit associations in LLMs (e.g., Xuechunzi Bai et al., 2025).
Takeaway: fairness interventions must be mechanism-aware and task-specific.
With the right causal targets, we can do surgical debiasing while preserving general capabilities.
📃 Paper: arxiv.org/pdf/2512.20796
🙏 Amazing advisor: Aaron Mueller @amuuueller.bsky.social
Takeaway: fairness interventions must be mechanism-aware and task-specific.
With the right causal targets, we can do surgical debiasing while preserving general capabilities.
📃 Paper: arxiv.org/pdf/2512.20796
🙏 Amazing advisor: Aaron Mueller @amuuueller.bsky.social
Mechanistically, bias often doesn’t live in explicit demographic tokens. It instead hides in contextual proxies like formality, technical language, and “competence” cues. This explains why direct ablation methods can often fail.
Mechanistically, bias often doesn’t live in explicit demographic tokens. It instead hides in contextual proxies like formality, technical language, and “competence” cues. This explains why direct ablation methods can often fail.
We find that race, gender, and education shortcuts rely on different internal mechanisms, so no single debiasing method works universally.
In other words, there is no one-size-fits-all debiasing method!
We find that race, gender, and education shortcuts rely on different internal mechanisms, so no single debiasing method works universally.
In other words, there is no one-size-fits-all debiasing method!
We compare attribution-based (“output” features) and correlation-based (“input” features) steering in LLMs. This follows the input/output distinction of @danaarad.bsky.social and @boknilev.bsky.social: some representations detect concepts in inputs, while others predict concepts in outputs.
We compare attribution-based (“output” features) and correlation-based (“input” features) steering in LLMs. This follows the input/output distinction of @danaarad.bsky.social and @boknilev.bsky.social: some representations detect concepts in inputs, while others predict concepts in outputs.
We study the use of demographic information where this info is:
• causally relevant (name → demographic),
• irrelevant (profession → demographic), or
• partially relevant (profession → education).
This lets us separate legitimate recognition from stereotyping.
We study the use of demographic information where this info is:
• causally relevant (name → demographic),
• irrelevant (profession → demographic), or
• partially relevant (profession → education).
This lets us separate legitimate recognition from stereotyping.
We study implicit biases via a word association task: the model assigns demographic labels to names or professions (e.g., “engineer → ?”, “Jack → ?”).
Inspired by prior work on implicit associations in LLMs (e.g., Xuechunzi Bai et al., 2025).
We study implicit biases via a word association task: the model assigns demographic labels to names or professions (e.g., “engineer → ?”, “Jack → ?”).
Inspired by prior work on implicit associations in LLMs (e.g., Xuechunzi Bai et al., 2025).