https://itay1itzhak.github.io/
@boknilev @GabiStanovsky!
Preprint: arxiv.org/abs/2507.07186
Webpage: itay1itzhak.github.io/planted-in-...
We’d love your thoughts, critiques, and ideas 📬
Let’s talk about building more interpretable and trustworthy LLMs!
#NLProc #Bias #CognitiveAI
@boknilev @GabiStanovsky!
Preprint: arxiv.org/abs/2507.07186
Webpage: itay1itzhak.github.io/planted-in-...
We’d love your thoughts, critiques, and ideas 📬
Let’s talk about building more interpretable and trustworthy LLMs!
#NLProc #Bias #CognitiveAI
Cognitive biases are not introduced during instruction tuning.
They’re planted in pretraining and only surfaced by finetuning.
If we want fairer models, we need to look deeper into the pretraining pipeline.
Cognitive biases are not introduced during instruction tuning.
They’re planted in pretraining and only surfaced by finetuning.
If we want fairer models, we need to look deeper into the pretraining pipeline.
We swap instruction datasets between models with different pretraining.
Result: Biases follow the pretrained model!
PCA clearly shows models group by pretraining base, not by instruction.
The bias “signature” stays intact, no matter the finetuning!
We swap instruction datasets between models with different pretraining.
Result: Biases follow the pretrained model!
PCA clearly shows models group by pretraining base, not by instruction.
The bias “signature” stays intact, no matter the finetuning!
We finetune the same model 3× with different seeds.
Result: Some variation in bias scores, but behavior patterns stay stable compared to MMLU variance.
✅ Aggregating across seeds reveals consistent trends.
We finetune the same model 3× with different seeds.
Result: Some variation in bias scores, but behavior patterns stay stable compared to MMLU variance.
✅ Aggregating across seeds reveals consistent trends.
- Pretraining
- Instruction tuning
- Training randomness
- 🍁 Bottom line - pretraining is the origin of bias. Finetuning? Just the messenger
#CausalInference #TrustworthyAI #NLP
- Pretraining
- Instruction tuning
- Training randomness
- 🍁 Bottom line - pretraining is the origin of bias. Finetuning? Just the messenger
#CausalInference #TrustworthyAI #NLP
(Similar to how Llama 3 pretraining used quality scores from Llama 2 and Roberta)
(Similar to how Llama 3 pretraining used quality scores from Llama 2 and Roberta)