lecture 5 Flashcards by V.I.N E.S.H

What is survivorship bias?

The tendency to focus on surviving examples while ignoring those that didn’t make it, leading to skewed conclusions.

How well did you know this?

Not at all

Perfectly

Why is it important not to take data at face value?

Because data might be biased or incomplete, affecting conclusions and predictions.

How well did you know this?

Not at all

Perfectly

What is Anscombe’s quartet?

A set of four datasets with identical statistical properties but very different distributions when visualized.

How well did you know this?

Not at all

Perfectly

Why should we always visualize data before analysis?

Because summary statistics alone can be misleading.

How well did you know this?

Not at all

Perfectly

What is a scatter plot matrix?

A grid of scatter plots showing pairwise relationships between multiple features.

How well did you know this?

Not at all

Perfectly

What are common issues in raw data?

Missing values, outliers, and class imbalance.

How well did you know this?

Not at all

Perfectly

Why do missing values occur in datasets?

They can be due to data collection issues, human errors, or non-responses.

How well did you know this?

Not at all

Perfectly

What are missing labels in datasets?

Situations where the target variable is not provided for some instances.

How well did you know this?

Not at all

Perfectly

What are the simplest ways to handle missing data?

Removing the feature, removing instances, or imputing missing values.

How well did you know this?

Not at all

Perfectly

What should you consider before removing missing data?

Whether the missing data is uniformly distributed or biased towards certain groups.

How well did you know this?

Not at all

Perfectly

What is imputation?

The process of filling in missing values with estimated values.

How well did you know this?

Not at all

Perfectly

What are common imputation techniques for categorical data?

Using the most frequent category (mode) or predicting the missing category.

How well did you know this?

Not at all

Perfectly

What are common imputation techniques for numerical data?

Using the mean, median, or predicting values using a regression model.

How well did you know this?

Not at all

Perfectly

What should be done if missing values are expected in production?

Train models that can handle missing values and keep missing values in the test set.

How well did you know this?

Not at all

Perfectly

What is the best way to handle missing labels in training data?

Either train only on labeled data or use semi-supervised learning if many labels are missing.

How well did you know this?

Not at all

Perfectly

Why should missing labels in test data not be ignored?

Study These Flashcards

Because ignoring them can lead to biased performance evaluation.

What is an outlier?

Study These Flashcards

A data point that is significantly different from the rest of the data.

What are unnatural outliers?

Study These Flashcards

Errors or corrupted values that should be removed or treated as missing data.

What are natural outliers?

Study These Flashcards

Extreme but valid values that should not be removed unless they distort the analysis.

How can outliers impact models that assume normality?

Study These Flashcards

They can heavily skew results, making models perform poorly.

What is an example of a real-world natural outlier?

Study These Flashcards

Bill Gates’ income in a dataset of personal earnings.

What are the key questions to ask about outliers?

Study These Flashcards

Are they mistakes? Can we expect them in production?

What is the impact of assuming normally distributed data?

Study These Flashcards

It can lead to incorrect conclusions if the actual data distribution is skewed.

What is mean imputation?

Study These Flashcards

Replacing missing values with the mean of the existing data.

Why is mean imputation problematic in some cases?

Because extreme values (outliers) can heavily distort the mean.

What is median imputation?

Replacing missing values with the median, which is more robust to outliers.

What is mean absolute error (MAE)?

A metric that measures the average absolute difference between predicted and actual values.

Why might MAE be preferable to squared error?

Because it is less sensitive to extreme values.

What is squared error minimization?

A method where the sum of squared differences is minimized to estimate the central value.

What is the key takeaway for data preprocessing?

Always investigate missing values, outliers, and distribution assumptions before training models.

lecture 5 Flashcards

(30 cards)