Datasets Flashcards

Question 1

Q

Real Data

Answer

A

Collected from real life situations

Question 2

Q

Synthetic Data

Answer

A

Artificially generated data using algorithms or simulations. They are designed to mimic the statistical properties of real data.

Question 3

Q

3 advantages of real data

Answer

A

Authentic and accurately reflects real life, making the model more accurate
Complex in this sense that it captures natural variability and anomalies
Insights from real data are more trustable

Question 4

Q

3 disadvantages of real data

Answer

A

Can be expensive or/and time-consuming to gather
Real data may require extensive cleaning and preprocessing to ensure quality
Access to real data may be difficult due to legal regulations and privacy laws.

Question 5

Q

Synthetic Data 3 advantages

Answer

A

It is cost-effective since there is no collection process
Synthetic data does not represent real individuals so it does not break any ethics of privacy
It can be customized to be more balanced (not include rare cases)

Question 6

Q

Synthetic Data 3 disadvantages

Answer

A

Lack of realism
To generate high-quality synthetic data, there is a complex, resource-intensive generation process
Skepticism from stakeholders and certain industries like healthcare and finance may not accept models trained on synthetic data.

Question 7

Q

Confirmation bias in datasets and solution

Answer

A

This occurs when the dataset favors a particular viewpoint, leading to skewed model predictions.

Ensure the training data is diverse and representative of all possible viewpoints.

Question 8

Q

Historical bias in datasets and solution

Answer

A

When the training data has outdated information.

Regularly update the training data to include recent information and trends.

Question 9

Q

Labelling bias in datasets and solution

Answer

A

This occurs when the labels applied to data are subjective, inaccurate, or incomplete, affecting the model’s performance.

To fix this, implement a detailed and consistent labeling process. Also have tools to detect and correct labeling consistencies.

Question 10

Q

Linguistic bias in datasets and solution

Answer

A

When the dataset is biased towards specific linguistic features. For example, it may be biased towards formal language and neglect variations in a different linguistic style.

Solution:
Include diverse linguistic styles and dialects as part of the training data.

Question 11

Q

Sampling Bias in datasets and solution

Answer

A

This occurs when the training dataset is not representative of the entire population, leading to reduced model performance.

Ensure the training dataset is representative of the entire target population. For instance, this can be done through stratified sampling to maintain diversity across various demographics.

Question 12

Q

Selection bias in datasets and solution

Answer

A

This occurs when the training data has not been made with random selection, but are instead chosen based on a specific criteria, leading to skewed model behavior.

Use random sampling techniques to select training data.

Datasets Flashcards

(12 cards)