Dataset Management & Bias Mitigation Flashcards

Question 1

Q

Why is the quality of the training dataset critical for a chatbot’s performance?

Answer

A

Because the chatbot learns from the examples in its dataset, and if these examples are insufficient or skewed, the chatbot’s understanding and accuracy will be poor.

Question 2

Q

What are the key characteristics of a high-quality dataset for chatbot training?

Answer

A

It should be large, accurate, well-annotated, classified/structured, readable, domain-specific, and relevant to the chatbot’s application.

Question 3

Q

What does it mean for a dataset to be ‘large’ in the context of chatbot training?

Answer

A

It means having enough examples to cover the variety of ways users might ask questions, which helps the model generalize and handle different phrasings.

Question 4

Q

What is meant by a dataset being ‘accurate and well-annotated’?

Answer

A

It means that any labels (like intent tags or entity labels) are correct and consistent, ensuring that the chatbot learns the right patterns without introducing labeling bias.

Question 5

Q

What does ‘classified/structured’ imply for training data?

Answer

A

It implies that the data is organized and preprocessed—cleaned of typos, standardized, and properly split into training, validation, and test sets.

Question 6

Q

How does the ‘readable’ quality of a dataset affect chatbot training?

Answer

A

A readable dataset is in a suitable format and quality for training, reflecting actual user language without garbled or irrelevant content.

Question 7

Q

What does ‘domain-specific and relevant’ mean for a training dataset?

Answer

A

It means the data covers the specific topics, terminology, and conversation styles relevant to the chatbot’s intended field, ensuring it can handle queries accurately.

Question 8

Q

What is dataset bias and why is it problematic for chatbots?

Answer

A

Dataset bias refers to systematic prejudices or skews in the training data, which can lead to unfair, unbalanced, or incorrect behavior in the chatbot.

Question 9

Q

What is confirmation bias in the context of training datasets?

Answer

A

Confirmation bias occurs when the data predominantly reflects a particular viewpoint or expected outcome, potentially causing the chatbot to always side with that bias.

Question 10

Q

What is historical bias in training datasets?

Answer

A

Historical bias happens when the dataset contains outdated norms or information, leading the chatbot to offer advice that may no longer be applicable or fair by current standards.

Question 11

Q

What is labeling bias in training data?

Answer

A

Labeling bias arises when labels are influenced by subjective judgments or inconsistencies, causing the chatbot to learn and replicate incorrect patterns.

Question 12

Q

What is linguistic bias in a dataset and how can it affect a chatbot?

Answer

A

Linguistic bias occurs when the dataset favors a particular language style or dialect, which may result in the chatbot performing poorly with informal language or different dialects.

Question 13

Q

What is sampling bias in the context of training datasets?

Answer

A

Sampling bias happens when the dataset isn’t representative of the entire user population, leading to good performance on well-represented scenarios and poor performance on under-represented ones.

Question 14

Q

What is selection bias in training datasets?

Answer

A

Selection bias is introduced when the data collected is filtered by specific criteria that exclude random variety, potentially missing important edge cases.

Question 15

Q

How can developers identify biases in a training dataset?

Answer

A

By performing statistical analysis, reviewing random samples of chat logs, and conducting bias tests with curated test cases to check for systematic skews.

Question 16

Q

What is one method to mitigate sampling bias in a training dataset?

Answer

A

Augmenting or re-balancing the dataset by adding more data to cover under-represented cases can reduce sampling bias.

Question 17

Q

How does data anonymization and scrubbing help mitigate bias-related issues?

Answer

A

By removing or masking personal identifiers, it prevents the chatbot from learning and inadvertently revealing personal information from the training data.

Question 18

Q

What are bias-aware training techniques?

Answer

A

Techniques like re-weighting examples to give more importance to minority classes or using adversarial training to reduce sensitivity to sensitive attributes during model training.

Question 19

Q

How can synthetic data generation improve the training dataset?

Answer

A

Synthetic data generation creates new training examples through paraphrasing, machine translation, simulation, or AI-generated content to supplement real data and cover rare or under-represented cases.

Question 20

Q

What is the overall goal of managing and augmenting the training dataset for a chatbot?

Answer

A

To ensure the chatbot learns from high-quality, balanced examples, thereby improving its accuracy, fairness, and ability to handle diverse queries without inheriting biases.