Datasets Flashcards
Real Data
Collected from real life situations
Synthetic Data
Artificially generated data using algorithms or simulations. They are designed to mimic the statistical properties of real data.
3 advantages of real data
- Authentic and accurately reflects real life, making the model more accurate
- Complex in this sense that it captures natural variability and anomalies
- Insights from real data are more trustable
3 disadvantages of real data
- Can be expensive or/and time-consuming to gather
- Real data may require extensive cleaning and preprocessing to ensure quality
- Access to real data may be difficult due to legal regulations and privacy laws.
Synthetic Data 3 advantages
- It is cost-effective since there is no collection process
- Synthetic data does not represent real individuals so it does not break any ethics of privacy
- It can be customized to be more balanced (not include rare cases)
Synthetic Data 3 disadvantages
- Lack of realism
- To generate high-quality synthetic data, there is a complex, resource-intensive generation process
- Skepticism from stakeholders and certain industries like healthcare and finance may not accept models trained on synthetic data.
Confirmation bias in datasets and solution
This occurs when the dataset favors a particular viewpoint, leading to skewed model predictions.
Ensure the training data is diverse and representative of all possible viewpoints.
Historical bias in datasets and solution
When the training data has outdated information.
Regularly update the training data to include recent information and trends.
Labelling bias in datasets and solution
This occurs when the labels applied to data are subjective, inaccurate, or incomplete, affecting the model’s performance.
To fix this, implement a detailed and consistent labeling process. Also have tools to detect and correct labeling consistencies.
Linguistic bias in datasets and solution
When the dataset is biased towards specific linguistic features. For example, it may be biased towards formal language and neglect variations in a different linguistic style.
Solution:
Include diverse linguistic styles and dialects as part of the training data.
Sampling Bias in datasets and solution
This occurs when the training dataset is not representative of the entire population, leading to reduced model performance.
Ensure the training dataset is representative of the entire target population. For instance, this can be done through stratified sampling to maintain diversity across various demographics.
Selection bias in datasets and solution
This occurs when the training data has not been made with random selection, but are instead chosen based on a specific criteria, leading to skewed model behavior.
Use random sampling techniques to select training data.