Dataset Management & Bias Mitigation Flashcards
Why is the quality of the training dataset critical for a chatbot’s performance?
Because the chatbot learns from the examples in its dataset, and if these examples are insufficient or skewed, the chatbot’s understanding and accuracy will be poor.
What are the key characteristics of a high-quality dataset for chatbot training?
It should be large, accurate, well-annotated, classified/structured, readable, domain-specific, and relevant to the chatbot’s application.
What does it mean for a dataset to be ‘large’ in the context of chatbot training?
It means having enough examples to cover the variety of ways users might ask questions, which helps the model generalize and handle different phrasings.
What is meant by a dataset being ‘accurate and well-annotated’?
It means that any labels (like intent tags or entity labels) are correct and consistent, ensuring that the chatbot learns the right patterns without introducing labeling bias.
What does ‘classified/structured’ imply for training data?
It implies that the data is organized and preprocessed—cleaned of typos, standardized, and properly split into training, validation, and test sets.
How does the ‘readable’ quality of a dataset affect chatbot training?
A readable dataset is in a suitable format and quality for training, reflecting actual user language without garbled or irrelevant content.
What does ‘domain-specific and relevant’ mean for a training dataset?
It means the data covers the specific topics, terminology, and conversation styles relevant to the chatbot’s intended field, ensuring it can handle queries accurately.
What is dataset bias and why is it problematic for chatbots?
Dataset bias refers to systematic prejudices or skews in the training data, which can lead to unfair, unbalanced, or incorrect behavior in the chatbot.
What is confirmation bias in the context of training datasets?
Confirmation bias occurs when the data predominantly reflects a particular viewpoint or expected outcome, potentially causing the chatbot to always side with that bias.
What is historical bias in training datasets?
Historical bias happens when the dataset contains outdated norms or information, leading the chatbot to offer advice that may no longer be applicable or fair by current standards.
What is labeling bias in training data?
Labeling bias arises when labels are influenced by subjective judgments or inconsistencies, causing the chatbot to learn and replicate incorrect patterns.
What is linguistic bias in a dataset and how can it affect a chatbot?
Linguistic bias occurs when the dataset favors a particular language style or dialect, which may result in the chatbot performing poorly with informal language or different dialects.
What is sampling bias in the context of training datasets?
Sampling bias happens when the dataset isn’t representative of the entire user population, leading to good performance on well-represented scenarios and poor performance on under-represented ones.
What is selection bias in training datasets?
Selection bias is introduced when the data collected is filtered by specific criteria that exclude random variety, potentially missing important edge cases.
How can developers identify biases in a training dataset?
By performing statistical analysis, reviewing random samples of chat logs, and conducting bias tests with curated test cases to check for systematic skews.
What is one method to mitigate sampling bias in a training dataset?
Augmenting or re-balancing the dataset by adding more data to cover under-represented cases can reduce sampling bias.
How does data anonymization and scrubbing help mitigate bias-related issues?
By removing or masking personal identifiers, it prevents the chatbot from learning and inadvertently revealing personal information from the training data.
What are bias-aware training techniques?
Techniques like re-weighting examples to give more importance to minority classes or using adversarial training to reduce sensitivity to sensitive attributes during model training.
How can synthetic data generation improve the training dataset?
Synthetic data generation creates new training examples through paraphrasing, machine translation, simulation, or AI-generated content to supplement real data and cover rare or under-represented cases.
What is the overall goal of managing and augmenting the training dataset for a chatbot?
To ensure the chatbot learns from high-quality, balanced examples, thereby improving its accuracy, fairness, and ability to handle diverse queries without inheriting biases.