Zara Siddique
@zarasiddique.bsky.social
Working on ethics and bias in NLP @CardiffNLP #NLP #NLProc
Shoutout to supervisors Liam Turner and Luis Espinosa-Anke and @cardiffnlp.bsky.social. I'm also interested in future collaborations on the topic so please message if you are interested :)
May 14, 2025 at 10:29 AM
Shoutout to supervisors Liam Turner and Luis Espinosa-Anke and @cardiffnlp.bsky.social. I'm also interested in future collaborations on the topic so please message if you are interested :)
I highly encourage people to play around, you can get started in just a few lines. Here's a Colab notebook:
tinyurl.com/yysmb45c
Note that the results from this Colab won't be the best because it's using a smaller model to reduce loading times. I would recommend using at least a 7B.
tinyurl.com/yysmb45c
Note that the results from this Colab won't be the best because it's using a smaller model to reduce loading times. I would recommend using at least a 7B.
Dialz Tutorial - Zara Siddique - KnitTogether 2025.ipynb
Colab notebook
drive.google.com
May 14, 2025 at 10:29 AM
I highly encourage people to play around, you can get started in just a few lines. Here's a Colab notebook:
tinyurl.com/yysmb45c
Note that the results from this Colab won't be the best because it's using a smaller model to reduce loading times. I would recommend using at least a 7B.
tinyurl.com/yysmb45c
Note that the results from this Colab won't be the best because it's using a smaller model to reduce loading times. I would recommend using at least a 7B.
As part of our validation, we see if we can reduce stereotypicality in outputs from Mistral 7B, using GPT-4o as a judge. There is a notable reduction compared to baselines and prompting, which is cool.
May 14, 2025 at 10:29 AM
As part of our validation, we see if we can reduce stereotypicality in outputs from Mistral 7B, using GPT-4o as a judge. There is a notable reduction compared to baselines and prompting, which is cool.
For those that are new to the topic, steering vectors are constructed using a set of paired sentences, where one elicits a 'positive' activation of neurons and the other elicits a 'negative' activation of neurons - by taking the difference, we isolate activations responsible for a certain 'concept'.
May 14, 2025 at 10:29 AM
For those that are new to the topic, steering vectors are constructed using a set of paired sentences, where one elicits a 'positive' activation of neurons and the other elicits a 'negative' activation of neurons - by taking the difference, we isolate activations responsible for a certain 'concept'.
Super interesting!
April 3, 2025 at 1:56 PM
Super interesting!
Do it! When interviewers ask me about them it’s usually a good sign that it’s a nice workplace.
March 25, 2025 at 6:14 PM
Do it! When interviewers ask me about them it’s usually a good sign that it’s a nice workplace.
The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that SVE is a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.
March 13, 2025 at 11:44 AM
The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that SVE is a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.
Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender.
March 13, 2025 at 11:44 AM
Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender.
When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.2%, 4.7%, and 3.2% over the baseline for Mistral, Llama, and Qwen, respectively.
March 13, 2025 at 11:44 AM
When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.2%, 4.7%, and 3.2% over the baseline for Mistral, Llama, and Qwen, respectively.
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes.
March 13, 2025 at 11:44 AM
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes.
Super interesting work!
March 3, 2025 at 9:56 PM
Super interesting work!