P3 - Dataset Bias Flashcards

1
Q

What is dataset bias?

A

Dataset bias occurs when the training data disproportionately represents certain groups, contexts, or patterns, skewing the model’s behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Can you provide an example of dataset bias in a chatbot?

A

For example, a chatbot trained only on English text from Western sources may perform poorly with non-Western users or dialects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is historical bias in the context of dataset bias?

A

Historical bias refers to prejudices embedded in the original data, such as gendered job descriptions (e.g., “nurse” being associated predominantly with females).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is dataset bias problematic for chatbots?

A

It can lead to unfair or inaccurate behavior (such as misunderstandings or offensive responses), reduce trust and usability among diverse users, and perpetuate societal stereotypes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does linguistic bias mean?

A

Linguistic bias is the overrepresentation of certain languages, dialects, or styles of speech (formal vs. informal) in the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is sampling bias?

A

Sampling bias happens when specific user groups or demographics are overrepresented or underrepresented in the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can dataset bias be mitigated in chatbot training?

A

Mitigation strategies include ensuring diverse representation in the training dataset, using data augmentation to add examples of underrepresented groups, and regularly auditing the chatbot’s behaviour for fairness and inclusivity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Historical bias - example in law inforcement

A

Predictive policing tools like PredPol have been shown to disproportionately target racial minorities due to historical data reflecting biased law enforcement practices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Historical bias - example in healthcare

A

Hurley et al. found that the Framingham Risk Score, used to predict cardiovascular risk, was less accurate for ethnic minorities because it was primarily trained on data from white cohorts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Historical bias - mortgage underwriting example

A

They discovered that multiple LLMs systematically recommended more denials and higher interest rates for Black applicants compared to otherwise identical white applicants—especially for high-risk loans. Simply instructing the LLM to make unbiased decisions reduced these disparities significantly. (Bowen et al., 2024)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does cultural context affect model performance according to Zhu et al. (2024)?

A

Models trained on predominantly English datasets struggle with topics specific to other cultures (e.g., Traditional Chinese Medicine). Chinese LLMs like Qwen-max perform better in such contexts because their training data include more culturally relevant examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are examples of sampling bias in AI systems?

A
  1. A US healthcare algorithm predicted extra care needs favoring white patients, as cost history (a proxy for healthcare needs) was lower for Black patients on average.
  2. The COMPAS algorithm showed twice as many false positives for recidivism in Black offenders compared to white offenders.
  3. Amazon’s hiring algorithm was biased against women due to reliance on historical resume data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What mitigation strategy can help reduce dataset bias?

A

Data augmentation can mitigate bias by generating synthetic data to balance underrepresented groups. For instance, by labeling examples with a protected attribute, flipping it to create an “ideal world” dataset, and then integrating these synthetic examples with real data (Sharma et al., 2020).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly