Datasets Flashcards

1
Q

Real Data

A

Collected from real life situations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Synthetic Data

A

Artificially generated data using algorithms or simulations. They are designed to mimic the statistical properties of real data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 advantages of real data

A
  • Authentic and accurately reflects real life, making the model more accurate
  • Complex in this sense that it captures natural variability and anomalies
  • Insights from real data are more trustable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3 disadvantages of real data

A
  • Can be expensive or/and time-consuming to gather
  • Real data may require extensive cleaning and preprocessing to ensure quality
  • Access to real data may be difficult due to legal regulations and privacy laws.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Synthetic Data 3 advantages

A
  • It is cost-effective since there is no collection process
  • Synthetic data does not represent real individuals so it does not break any ethics of privacy
  • It can be customized to be more balanced (not include rare cases)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Synthetic Data 3 disadvantages

A
  • Lack of realism
  • To generate high-quality synthetic data, there is a complex, resource-intensive generation process
  • Skepticism from stakeholders and certain industries like healthcare and finance may not accept models trained on synthetic data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Confirmation bias in datasets and solution

A

This occurs when the dataset favors a particular viewpoint, leading to skewed model predictions.

Ensure the training data is diverse and representative of all possible viewpoints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Historical bias in datasets and solution

A

When the training data has outdated information.

Regularly update the training data to include recent information and trends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Labelling bias in datasets and solution

A

This occurs when the labels applied to data are subjective, inaccurate, or incomplete, affecting the model’s performance.

To fix this, implement a detailed and consistent labeling process. Also have tools to detect and correct labeling consistencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Linguistic bias in datasets and solution

A

When the dataset is biased towards specific linguistic features. For example, it may be biased towards formal language and neglect variations in a different linguistic style.

Solution:
Include diverse linguistic styles and dialects as part of the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sampling Bias in datasets and solution

A

This occurs when the training dataset is not representative of the entire population, leading to reduced model performance.

Ensure the training dataset is representative of the entire target population. For instance, this can be done through stratified sampling to maintain diversity across various demographics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Selection bias in datasets and solution

A

This occurs when the training data has not been made with random selection, but are instead chosen based on a specific criteria, leading to skewed model behavior.

Use random sampling techniques to select training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly