Reducing Bias in LLMs

Large Language Models: A Double-Edged Sword

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP). These powerful models, trained on massive datasets, can perform a wide range of tasks, from generating human-quality text to translating languages. However, a growing concern is that these models can inadvertently learn and perpetuate biases present in the data they are trained on. This can lead to harmful stereotypes and discrimination, especially for marginalised groups.

In this research, we explore a method to mitigate bias in LLMs by fine-tuning them on carefully selected texts. We also delve into the limitations of existing bias tests and propose directions for future research to improve bias detection and mitigation.

LLMs are trained by using a vast corpus of text and encouraging the model to attribute ‘meaning’ to a given word based on surrounding words. If the words ‘cat’ and ‘kitten’ occur in the same passage of text with high frequency, the model is likely to learn that these words are in some way semantically close. While helpful in this context, this can be problematic in others. Specifically, LLMs can learn semantic closeness between words based on the text, where no semantic closeness should exist. If an LLM is trained exclusively on stories of cats that were aggressive and attacked their owners, it may learn an association between cats and aggression that many people would think unfair. In this way, LLMs inherit the biases present in the texts on which they are trained.

These biased associations can have serious consequences. Language models are now a cornerstone of many AI applications. When they're biased, so are the systems they power. Consider these real-world examples:

Biased Translations: Google Translate has been shown to perpetuate gender stereotypes in translations.

Discriminatory Hiring: An Amazon hiring tool favored male candidates over equally qualified female applicants.

Hateful AI: A Korean chatbot generated hateful speech targeting LGBTQ+ individuals.

Language has always been a carrier of bias. From ancient texts to modern-day social media, language reflects and reinforces societal norms, often unfairly. Even powerful language models like LLMs can fall victim to these biases.

LLMs learn from the data they're trained on. Unfortunately, this data often reflects societal biases, such as gender stereotypes and racial prejudices. For instance, datasets derived from Wikipedia and Reddit tend to be dominated by certain demographics, leading to models that are biased towards those groups. Gender imbalance is a common issue in many datasets. It can lead to models that are more likely to associate certain professions or roles with specific genders. Additionally, the way datasets are curated can inadvertently exclude marginalised voices and perpetuate harmful stereotypes. For example, filtering out certain types of language, even if it's offensive, can silence the experiences of marginalised groups.

As LLMs grow more powerful and influential, the biases they harbour become increasingly problematic. It's crucial for AI researchers and engineers to develop tools that not only avoid perpetuating harmful biases but actively challenge them. To address this challenge, we set out two primary goals:

Developing a Bias Mitigation Framework: We propose a method to reduce bias in LLMs by fine-tuning them on carefully curated datasets. By exposing the model to more diverse and representative data, we aim to challenge and correct existing biases.

Evaluating Bias Mitigation: We'll apply our framework to BERT, a popular language model, and assess the impact of our approach using two established bias benchmarks: StereoSet and WinoBias.

Background

While there's been a surge in efforts to mitigate bias in LLMs, the focus has often been narrowly defined, primarily centering on gender bias. This limited perspective overlooks a broader spectrum of biases that can be equally harmful. The ease of quantifying and addressing gender bias might explain this trend. However, relying solely on legally protected attributes, such as gender and race, is insufficient. Many other characteristics, like socioeconomic status and geographic location, can also be sources of bias. For instance, class-based discrimination, a pervasive issue in society, is frequently overlooked in LLM bias research.

One approach to mitigating bias involves creating counterfactual examples. For instance, adding sentences like "She is a doctor" to balance out "He is a doctor" can help reduce gender bias during fine-tuning. However, this technique is limited by its reliance on retraining and its focus on specific, often easily identifiable biases.

More innovative approaches are emerging. One promising method employs reinforcement learning to reduce political bias. By rewarding models for generating unbiased text and penalising biased output, researchers are exploring new ways to address the complex issue of bias in LLMs.

Researchers have employed word analogy tasks to uncover biases embedded within word embeddings. For instance, studies have shown that words like "he" and "man" are more frequently associated with high-status professions like "doctor." To further explore these biases, new testing methods have been developed. These tests involve creating sentence templates with masked words related to stereotyped gender roles. By analysing the model's predictions for these masked words, researchers have identified significant gender biases in LLMs.

While these tests often focus on single-axis biases, it's important to recognise that LLMs can harbour biases along multiple dimensions. Recent research has highlighted the prevalence of racial bias in LLMs, revealing that individuals belonging to intersectional minority groups, such as Black women, may experience even greater discrimination than their constituent groups.

Methods

Fine-tuning the language model

For our experiments, we utilised the BERT-base-uncased model from the Hugging Face Transformers library. To fine-tune this model, we adopted a two-step approach. First, we fine-tuned the model on the Next Sentence Prediction (NSP) task. Subsequently, we further fine-tuned the model on the Masked Language Model (MLM) task, while transferring the weights from the MLM model to the NSP model, except for the task-specific output layers.

Datasets

To address the potential biases in the LLM's training data, we curated a fine-tuning dataset comprising a diverse range of texts. This dataset included autobiographies and young adult fiction written by both male and female authors, as well as works by right-wing and liberal political commentators. Our goal was to introduce the model to perspectives and voices that may have been underrepresented in its initial training.

Bias Assessment

To evaluate the effectiveness of our bias mitigation techniques, we employed two established benchmarks: StereoSet and WinoBias.

StereoSet is designed to assess a language model's susceptibility to stereotypical biases across various dimensions, including gender, profession, race, and religion. It utilises Context Association Tests (CATs) to measure how likely a model is to associate certain groups with stereotypical or anti-stereotypical traits. The model is presented with a context word (e.g., "Muslim" or "Russian") and asked to predict the likelihood of target words or sentences. These targets can be either stereotypical, anti-stereotypical, or unrelated to the given context. StereoSet evaluates performance using two metrics: the Language Model (LM) score and the Stereotype Score (SS). A higher LM score indicates the model's ability to generate meaningful associations, while a balanced SS score suggests a lack of bias towards either stereotypes or anti-stereotypes.

WinoBias focuses on gender bias in coreference resolution. It presents sentences with gendered pronouns and occupation-based descriptions (e.g., "the nurse," "the doctor") and assesses the model's ability to correctly link pronouns to their referents. A gender-biased model would tend to associate male pronouns with stereotypically male occupations and female pronouns with stereotypically female occupations. A perfect WinoBias score requires equal performance on both stereotypical and anti-stereotypical coreference tasks.

Experiments & Results Three experiments were defined to explore how fine-tuning can affect the bias in BERT. We undertook a fourth experiment exploring the sensitivity of the two tests. For each class of text, ten books were concatenated to minimise the impact of individual writing styles on the bias assessment.

Experiments and Results

Experiment 1: The Impact of Author Gender on Bias Mitigation

In our first experiment, we focused on autobiographies to leverage the power of first-person narratives. By training our model on a diverse collection of autobiographies written by both male and female authors, we aimed to counterbalance the underrepresentation of women in the original training data.

The rationale behind this approach is that stereotypes often stem from generalisations about groups of people. By exposing the model to a wider range of individual experiences, we can challenge these harmful generalisations. While autobiographies may not always explicitly reveal the gender of the narrator, they offer valuable insights into the author's perspective, their interactions with others, and their unique worldview.

By incorporating more female-authored autobiographies into the fine-tuning process, we sought to mitigate the gender bias inherent in the original LLM. This approach not only addresses the imbalance in representation but also enriches the model's understanding of diverse perspectives and experiences.

Experiment 1: Autobiographies Gender of the Author

For the female authors, StereoSet shows all biases decreasing with the exception of religion at the inter-sentence level. More of a mixed picture is shown for the male authors. For gender bias, specifically, we see that the collection of works by female authors reduced bias more than the male texts. The largest reduction in bias was in religious bias after fine-tuning on the female-author texts.

Experiment 2: The Impact of Protagonist Gender on Bias Mitigation

In our second experiment, we delved into the realm of young adult fiction, specifically focusing on the gender of the protagonists. Our hypothesis was that by exposing the model to stories featuring strong female protagonists, we could challenge traditional gender stereotypes and promote more equitable representations.

Young adult fiction, often narrated in the third person, offers a clear and explicit connection between the gender of the protagonist and their experiences. By analysing the differences between stories featuring male and female protagonists, we aimed to identify any potential biases and biases that may be perpetuated by the model.

Through this experiment, we sought to explore the potential of using literature to not only reflect the world as it is but also to envision a more just and equitable future.

Almost all biases are shown to decrease after fine-tuning the model on young adult fiction, both for texts where the protagonist is female and for texts where the protagonist is male. The largest reduction in bias was again in religious bias after fine-tuning on the female texts, as measured on the inter-sentence level. The one bias that increased was the religious bias, again after fine-tuning on the female-protagonist texts, but measured at the intra-sentence level.

Experiment 3: The Impact of Racial Bias and Social Justice

Our third experiment delved into the complex issue of racial bias by comparing two distinct sets of texts:

White Nationalist and White Rights Texts: This group of texts focused on promoting white supremacy and advocating for white rights.

Black Experiences and Social Justice Texts: This group of texts highlighted the experiences of Black individuals and explored themes of racial inequality and social justice.

By exposing the model to these contrasting perspectives, we aimed to challenge existing racial biases and promote a more equitable understanding of race relations. This experiment sought to investigate whether the model could be influenced to adopt more inclusive and tolerant viewpoints.

Experiment 3: Racial Bias and Social Justice

Fine-tuning the model on the pro-black rights text reduces all biases, except racial bias on the intra-sentence level, which increases by about 8%. For the pro-white rights texts, religious bias increases by a large amount on the intra-sentence level but decreases by an even more significant amount on the inter-sentence level.

Experiment 4a: Training on Stereotypical Sentences We trained the model on a dataset consisting solely of pro-stereotypical sentences extracted from the StereoSet test set. Our expectation was that this would lead to an increase in stereotypical biases, as reflected in a higher Stereotype Score (SS). As predicted, the model's SS score for gender increased significantly. However, we observed inconsistent results in the WinoBias scores, suggesting a potential lack of alignment between the two metrics.

Experiment 4b: Fine-Tuning with Single Sentence In this experiment, we aimed to assess the model's sensitivity to minimal amounts of training data. We fine-tuned the model with a single, carefully selected sentence. While we anticipated minimal impact on the overall performance, the results revealed surprising fluctuations in both StereoSet and WinoBias scores. This highlights the potential for even small amounts of training data to significantly influence the model's behaviour.

These findings underscore the need for caution when interpreting the results of bias evaluation metrics. It is crucial to consider the limitations of these metrics and to employ a multifaceted approach to assess bias in language models.

Discussion

While our research has shown promising results in mitigating bias in LLMs through fine-tuning on underrepresented datasets, we acknowledge the limitations of our approach and the challenges inherent in evaluating bias. One significant challenge is the inherent trade-off between fairness and model performance. We observed that, in many cases, reducing bias often led to a decrease in overall language model performance. This suggests that a careful balance must be struck between these two competing objectives.

Furthermore, the sensitivity of the bias evaluation metrics, particularly StereoSet and WinoBias, raises concerns about their reliability. Our experiments demonstrated that even minor changes to the training data can significantly impact the model's performance on these tests. This highlights the need for more robust and nuanced evaluation methods.

Our research has highlighted several limitations in the StereoSet and WinoBias benchmarks, which can significantly impact the reliability of bias evaluation.

StereoSet Limitations:

Uneven Distribution of Target Words: The dataset includes a wide range of target words, some of which are extremely rare in the training data. This can lead to disproportionate influence on the overall bias score, as even minor changes in the model's predictions for these rare words can have a significant impact.

Inconsistent Categorisation: The categorisation of target words into racial and religious groups is inconsistent. For example, the inclusion of geographic locations like "Vietnam" and "Afghanistan" alongside racial categories like "Persian people" and "Arab" raises questions about the intended classification scheme.

Grammatical Errors and Inconsistent Formatting: The presence of grammatical errors, spelling mistakes, and inconsistent pluralisation can introduce noise into the dataset and potentially bias the evaluation results.

The reliance on crowd-sourcing platforms like Amazon Mechanical Turk to collect bias data presents several significant challenges:

Demographic Bias: The demographics of Mechanical Turk workers do not necessarily reflect the diversity of the general population. This can lead to biases in the collected data, as the workers' own biases may influence their responses.

Quality Control Issues: The pay-per-task model can incentivise workers to prioritise speed over accuracy, leading to a decrease in the quality of the collected data. This can result in grammatical errors, inconsistencies, and biased responses.

Cultural and Linguistic Nuances: The biases that are prevalent in one language or culture may not be directly transferable to others. Therefore, it is crucial to consider the cultural and linguistic context when evaluating bias in language models.

Limitations of WinoBias

Narrow Focus on Gender and Professional Roles: WinoBias primarily focuses on gender bias in the context of professional roles, neglecting other forms of bias, such as racial, ethnic, or socioeconomic bias.

Oversimplification of Bias: The test assumes that gender bias can be accurately measured solely through coreference resolution, which may not capture more subtle forms of bias embedded in language.

Neglect of Intersectional Bias: WinoBias does not account for intersectional biases, which can arise from the intersection of multiple social identities (e.g., gender, race, ethnicity).

Conflation of Bias and Model Understanding: It can be difficult to distinguish between instances where a model exhibits gender bias and instances where it simply misunderstands the context or makes a random prediction.

Potential for Misinterpretation: A low difference between pro-stereotypical and anti-stereotypical scores does not necessarily indicate a lack of bias. A model that randomly predicts pronouns would also achieve a low difference score, but it would clearly be a highly biased model.

Conclusion

Our research has delved into the complex issue of bias in large language models and explored the potential of fine-tuning to mitigate these biases. While we have made significant strides in understanding the problem, we have also uncovered the limitations of current bias evaluation methods.

Both StereoSet and WinoBias, despite their widespread use, suffer from several methodological flaws. StereoSet's inconsistent categorisation and poor quality data can lead to unreliable results, while WinoBias's narrow focus on gender bias and coreference resolution fails to capture the full spectrum of linguistic biases.

The development of a comprehensive and rigorous bias evaluation framework remains a significant challenge. Given the subjective nature of bias, it may be impossible to create a perfect solution. However, we believe that by acknowledging the limitations of existing methods and exploring innovative approaches, we can make significant progress in this area.

Key Considerations for Future Research:

Diverse and Representative Datasets: To mitigate bias, it is crucial to train LLMs on diverse and representative datasets that reflect the complexities of human language and society.

Robust Bias Evaluation Metrics: Developing more robust and nuanced bias evaluation metrics is essential to accurately assess the impact of bias mitigation techniques.

Intersectional Bias: Future research should consider the intersectionality of biases, as individuals may be marginalised due to multiple factors, such as gender, race, and socioeconomic status.

Ethical Considerations: The development and deployment of LLMs must be guided by ethical principles to ensure that these technologies are used responsibly and equitably.

By addressing these challenges and embracing a multi-faceted approach to bias mitigation, we can work towards creating language models that are fair, unbiased, and beneficial to society.