https://mainuliitkgp.github.io/
📊 CoT maintains similar CK alignment compared to standard prompting for all the datasets, and also reduces PK alignment.
📊 CoT maintains similar CK alignment compared to standard prompting for all the datasets, and also reduces PK alignment.
📊 The gap between PK and CK is much higher for the examples with hallucinated spans than for the examples with no hallucinated spans across the sequence steps.
📊 The gap between PK and CK is much higher for the examples with hallucinated spans than for the examples with no hallucinated spans across the sequence steps.
📊 During most of the NLE generations, the model slightly prioritizes PK.
📊 During most of the NLE generations, the model slightly prioritizes PK.
📊 Different knowledge interactions are poorly captured by the rank-1 projection subspace in LLM model parameter
📊 Different knowledge interactions are poorly captured by the rank-1 projection subspace in LLM model parameter
3️⃣ Real & Fictional Bias Mitigation: Reduces both real-world stereotypes (e.g., “Italians are reckless drivers”) and fictional associations (e.g., “citizens of a fictional country have blue skin”), making it useful for both safety and interpretability research.
3️⃣ Real & Fictional Bias Mitigation: Reduces both real-world stereotypes (e.g., “Italians are reckless drivers”) and fictional associations (e.g., “citizens of a fictional country have blue skin”), making it useful for both safety and interpretability research.
2️⃣ Strong Generalization: Works on unseen biases during token-based fine-tuning.
2️⃣ Strong Generalization: Works on unseen biases during token-based fine-tuning.
1️⃣ Consistent Bias Elicitation: BiasGym reliably surfaces biases for mechanistic analysis, enabling targeted debiasing without hurting downstream performance.
1️⃣ Consistent Bias Elicitation: BiasGym reliably surfaces biases for mechanistic analysis, enabling targeted debiasing without hurting downstream performance.
📄 Read the paper: arxiv.org/abs/2508.08855
📄 Read the paper: arxiv.org/abs/2508.08855