Hiba Ahsan
hibaahsan.bsky.social
Hiba Ahsan
@hibaahsan.bsky.social
PhD student @ Northeastern University, Clinical NLP
https://hibaahsan.github.io/
she/her
Next, we test this in more realistic clinical tasks where the model must reason over clinical notes or scenarios. We find that the effect of ablating race latents is minimal; anti-bias prompting is more effective.
November 5, 2025 at 3:20 PM
Can such latents be used to mitigate bias? We first test this in a controlled setting: we sample patient vignettes for conditions for which the LLM exaggerates racial association. Ablating the latent reduces Black patient ratio. This works better than prompting with an anti-bias prompt.
November 5, 2025 at 3:20 PM
Does the “Black” latent have a causal effect on outputs? We steer with the latent and find that increasing the “Black” latent activation increases the predicted risk of patient belligerence. None of the reasoning chains indicate this reliance on race.
November 5, 2025 at 3:20 PM
We focus on Black patients and use Gemmascope SAEs. We find a latent that correlates with Black individuals. In clinical notes, the latent activates on “African-American” and conditions prevalent in the Black population. But it also activates on stigmatizing concepts like “incarceration”, “gunshot”.
November 5, 2025 at 3:20 PM
4. Finally, we look at how such interventions can be used to detect implicit biases in clinical tasks. We mechanistically control gender/race and find that Olmo considers females to be at higher risk of depression than males, and Black patients to be at higher risk than white patients.
February 22, 2025 at 4:18 AM
3. Race is more complicated. We find multiple patches and are able to intervene to a degree.
February 22, 2025 at 4:18 AM
2. These patches generalize to non-clinical domains!
February 22, 2025 at 4:18 AM
1. We perform activation patching in the context of clinical vignette generation and find that gender information is highly localized. Patching MLP activations in a single layer consistently alters patient gender.
February 22, 2025 at 4:18 AM