lecture 5 Flashcards
What is survivorship bias?
The tendency to focus on surviving examples while ignoring those that didn’t make it, leading to skewed conclusions.
Why is it important not to take data at face value?
Because data might be biased or incomplete, affecting conclusions and predictions.
What is Anscombe’s quartet?
A set of four datasets with identical statistical properties but very different distributions when visualized.
Why should we always visualize data before analysis?
Because summary statistics alone can be misleading.
What is a scatter plot matrix?
A grid of scatter plots showing pairwise relationships between multiple features.
What are common issues in raw data?
Missing values, outliers, and class imbalance.
Why do missing values occur in datasets?
They can be due to data collection issues, human errors, or non-responses.
What are missing labels in datasets?
Situations where the target variable is not provided for some instances.
What are the simplest ways to handle missing data?
Removing the feature, removing instances, or imputing missing values.
What should you consider before removing missing data?
Whether the missing data is uniformly distributed or biased towards certain groups.
What is imputation?
The process of filling in missing values with estimated values.
What are common imputation techniques for categorical data?
Using the most frequent category (mode) or predicting the missing category.
What are common imputation techniques for numerical data?
Using the mean, median, or predicting values using a regression model.
What should be done if missing values are expected in production?
Train models that can handle missing values and keep missing values in the test set.
What is the best way to handle missing labels in training data?
Either train only on labeled data or use semi-supervised learning if many labels are missing.
Why should missing labels in test data not be ignored?
Because ignoring them can lead to biased performance evaluation.
What is an outlier?
A data point that is significantly different from the rest of the data.
What are unnatural outliers?
Errors or corrupted values that should be removed or treated as missing data.
What are natural outliers?
Extreme but valid values that should not be removed unless they distort the analysis.
How can outliers impact models that assume normality?
They can heavily skew results, making models perform poorly.
What is an example of a real-world natural outlier?
Bill Gates’ income in a dataset of personal earnings.
What are the key questions to ask about outliers?
Are they mistakes? Can we expect them in production?
What is the impact of assuming normally distributed data?
It can lead to incorrect conclusions if the actual data distribution is skewed.
What is mean imputation?
Replacing missing values with the mean of the existing data.
Why is mean imputation problematic in some cases?
Because extreme values (outliers) can heavily distort the mean.
What is median imputation?
Replacing missing values with the median, which is more robust to outliers.
What is mean absolute error (MAE)?
A metric that measures the average absolute difference between predicted and actual values.
Why might MAE be preferable to squared error?
Because it is less sensitive to extreme values.
What is squared error minimization?
A method where the sum of squared differences is minimized to estimate the central value.
What is the key takeaway for data preprocessing?
Always investigate missing values, outliers, and distribution assumptions before training models.