P3 - Dataset Bias Flashcards
What is dataset bias?
Dataset bias occurs when the training data disproportionately represents certain groups, contexts, or patterns, skewing the model’s behavior.
Can you provide an example of dataset bias in a chatbot?
For example, a chatbot trained only on English text from Western sources may perform poorly with non-Western users or dialects.
What is historical bias in the context of dataset bias?
Historical bias refers to prejudices embedded in the original data, such as gendered job descriptions (e.g., “nurse” being associated predominantly with females).
Why is dataset bias problematic for chatbots?
It can lead to unfair or inaccurate behavior (such as misunderstandings or offensive responses), reduce trust and usability among diverse users, and perpetuate societal stereotypes.
What does linguistic bias mean?
Linguistic bias is the overrepresentation of certain languages, dialects, or styles of speech (formal vs. informal) in the training data.
What is sampling bias?
Sampling bias happens when specific user groups or demographics are overrepresented or underrepresented in the training dataset.
How can dataset bias be mitigated in chatbot training?
Mitigation strategies include ensuring diverse representation in the training dataset, using data augmentation to add examples of underrepresented groups, and regularly auditing the chatbot’s behaviour for fairness and inclusivity.
Historical bias - example in law inforcement
Predictive policing tools like PredPol have been shown to disproportionately target racial minorities due to historical data reflecting biased law enforcement practices
Historical bias - example in healthcare
Hurley et al. found that the Framingham Risk Score, used to predict cardiovascular risk, was less accurate for ethnic minorities because it was primarily trained on data from white cohorts.
Historical bias - mortgage underwriting example
They discovered that multiple LLMs systematically recommended more denials and higher interest rates for Black applicants compared to otherwise identical white applicants—especially for high-risk loans. Simply instructing the LLM to make unbiased decisions reduced these disparities significantly. (Bowen et al., 2024)
How does cultural context affect model performance according to Zhu et al. (2024)?
Models trained on predominantly English datasets struggle with topics specific to other cultures (e.g., Traditional Chinese Medicine). Chinese LLMs like Qwen-max perform better in such contexts because their training data include more culturally relevant examples.
What are examples of sampling bias in AI systems?
- A US healthcare algorithm predicted extra care needs favoring white patients, as cost history (a proxy for healthcare needs) was lower for Black patients on average.
- The COMPAS algorithm showed twice as many false positives for recidivism in Black offenders compared to white offenders.
- Amazon’s hiring algorithm was biased against women due to reliance on historical resume data.
What mitigation strategy can help reduce dataset bias?
Data augmentation can mitigate bias by generating synthetic data to balance underrepresented groups. For instance, by labeling examples with a protected attribute, flipping it to create an “ideal world” dataset, and then integrating these synthetic examples with real data (Sharma et al., 2020).