Itay Itzhak @ COLM 🍁
itay-itzhak.bsky.social
Itay Itzhak @ COLM 🍁
@itay-itzhak.bsky.social
NLProc, deep learning, and machine learning. Ph.D. student @ Technion and The Hebrew University.
https://itay1itzhak.github.io/
July 15, 2025 at 7:08 PM
Huge thanks to my co-authors
@boknilev @GabiStanovsky!
Preprint: arxiv.org/abs/2507.07186
Webpage: itay1itzhak.github.io/planted-in-...
We’d love your thoughts, critiques, and ideas 📬
Let’s talk about building more interpretable and trustworthy LLMs!
#NLProc #Bias #CognitiveAI
Planted in Pretraining, Swayed by Finetuning: A Case Study on the...
Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across...
arxiv.org
July 15, 2025 at 1:38 PM
🧠 Takeaway:
Cognitive biases are not introduced during instruction tuning.
They’re planted in pretraining and only surfaced by finetuning.
If we want fairer models, we need to look deeper into the pretraining pipeline.
July 15, 2025 at 1:38 PM
🔄 Step 2: Cross-tuning.
We swap instruction datasets between models with different pretraining.
Result: Biases follow the pretrained model!

PCA clearly shows models group by pretraining base, not by instruction.
The bias “signature” stays intact, no matter the finetuning!
July 15, 2025 at 1:38 PM
🎲 Step 1: Training randomness.
We finetune the same model 3× with different seeds.
Result: Some variation in bias scores, but behavior patterns stay stable compared to MMLU variance.
✅ Aggregating across seeds reveals consistent trends.
July 15, 2025 at 1:38 PM
🧪 We introduce a two-step causal framework to disentangle the effects of:
- Pretraining
- Instruction tuning
- Training randomness

- 🍁 Bottom line - pretraining is the origin of bias. Finetuning? Just the messenger
#CausalInference #TrustworthyAI #NLP
July 15, 2025 at 1:38 PM
Super interesting! Have you tested how LAP handles more diverse paraphrasing? For example, do you think it would also work for code functions with similar roles?
March 5, 2025 at 3:52 PM
Why not try the straightforward approach: label high-quality texts and train an LM to classify them? Of course this should be done separately for different types of texts - a great scientific paper ≠ a great novel.
(Similar to how Llama 3 pretraining used quality scores from Llama 2 and Roberta)
December 10, 2024 at 9:32 AM