lecture 5 Flashcards

1
Q

What is survivorship bias?

A

The tendency to focus on surviving examples while ignoring those that didn’t make it, leading to skewed conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it important not to take data at face value?

A

Because data might be biased or incomplete, affecting conclusions and predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Anscombe’s quartet?

A

A set of four datasets with identical statistical properties but very different distributions when visualized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why should we always visualize data before analysis?

A

Because summary statistics alone can be misleading.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a scatter plot matrix?

A

A grid of scatter plots showing pairwise relationships between multiple features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are common issues in raw data?

A

Missing values, outliers, and class imbalance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do missing values occur in datasets?

A

They can be due to data collection issues, human errors, or non-responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are missing labels in datasets?

A

Situations where the target variable is not provided for some instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the simplest ways to handle missing data?

A

Removing the feature, removing instances, or imputing missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What should you consider before removing missing data?

A

Whether the missing data is uniformly distributed or biased towards certain groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is imputation?

A

The process of filling in missing values with estimated values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are common imputation techniques for categorical data?

A

Using the most frequent category (mode) or predicting the missing category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are common imputation techniques for numerical data?

A

Using the mean, median, or predicting values using a regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What should be done if missing values are expected in production?

A

Train models that can handle missing values and keep missing values in the test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the best way to handle missing labels in training data?

A

Either train only on labeled data or use semi-supervised learning if many labels are missing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why should missing labels in test data not be ignored?

A

Because ignoring them can lead to biased performance evaluation.

17
Q

What is an outlier?

A

A data point that is significantly different from the rest of the data.

18
Q

What are unnatural outliers?

A

Errors or corrupted values that should be removed or treated as missing data.

19
Q

What are natural outliers?

A

Extreme but valid values that should not be removed unless they distort the analysis.

20
Q

How can outliers impact models that assume normality?

A

They can heavily skew results, making models perform poorly.

21
Q

What is an example of a real-world natural outlier?

A

Bill Gates’ income in a dataset of personal earnings.

22
Q

What are the key questions to ask about outliers?

A

Are they mistakes? Can we expect them in production?

23
Q

What is the impact of assuming normally distributed data?

A

It can lead to incorrect conclusions if the actual data distribution is skewed.

24
Q

What is mean imputation?

A

Replacing missing values with the mean of the existing data.

25
Q

Why is mean imputation problematic in some cases?

A

Because extreme values (outliers) can heavily distort the mean.

26
Q

What is median imputation?

A

Replacing missing values with the median, which is more robust to outliers.

27
Q

What is mean absolute error (MAE)?

A

A metric that measures the average absolute difference between predicted and actual values.

28
Q

Why might MAE be preferable to squared error?

A

Because it is less sensitive to extreme values.

29
Q

What is squared error minimization?

A

A method where the sum of squared differences is minimized to estimate the central value.

30
Q

What is the key takeaway for data preprocessing?

A

Always investigate missing values, outliers, and distribution assumptions before training models.